Data Science Bootstrap

Use Jupyter as an experimentation playground

What are the use cases for Jupyter?

I use Jupyter notebooks in the following ways.

Firstly, I use them as a prototyping environment. They are wonderful, because I can hold the state of a program in memory and interactively modify it until I get what I need out of the program. (This especially saves on time spent re-computing things.)

Secondly, I use Jupyter as an authoring environment for interactive computational teaching material. For example, I structured Network Analysis Made Simple as a series of Jupyter notebooks.

Finally, on occasion, I use Jupyter with ipywidgets and Voila to build out dashboards and interactive applications for my colleagues.

How do you get Jupyter?

Get Jupyter installed in each of your environments, by including it in your environment.yml file. (see: Create one conda environment per project)

Doing so is based on advice I received at SciPy 2016, in which one of the Jupyter developers strongly advised against "global" installations of Jupyter, to avoid package conflicts.

How do you get Jupyter to recognize your environment's Python?

To get Jupyter to recognize the Python interpreter that defined by your conda environment (see: Create one conda environment per project), you need to make sure you have ipykernel installed inside your environment. Then, use the following command:

export ENV_NAME="put_your_environment_name_here"
conda activate $ENV_NAME
python -m ipykernel install --user --name $ENV_NAME

How do you launch Jupyter?

Newcomers to Anaconda are usually spoonfed the GUI, but I am a proponent of launching Jupyter from the terminal because doing so makes us fully aware of our environment, including the environment variables. (see the related: Create runtime environment variable configuration files for each of your projects and Take full control of your shell environment variables)

To launch Jupyter:

Open your shell
Navigate to your project directory
Activate your conda environment
Then launch Jupyter Lab: jupyter lab

In shell terms:

cd /path/to/project/directory
conda activate $ENV_NAME
jupyter lab

Pages that link here

Install Anaconda on your machine
What is anaconda Anaconda is a way to get a Python installed on your system

Choose and customize your development environment
At the end of the day, we choose a development environment that we are most comfortable with

Keep your notebooks organized with logical categories
In my experience, there are three types of notebooks that get written

Take full control of your shell environment variables

Why control your environment variables

If you're not sure what environment variables are, I have an essay on them that you can reference. Mastering environment variables is crucial for data scientists!

Your shell environment, whether it is zsh or bash or fish or something else, is supremely important. It determines the runtime environment, which in turn determines which Python you're using, whether you have proxies set correctly, and more. Rather than leave this to chance, I would recommend instead gaining full control over your environment variables.

How do I control my environment variables

The simplest way is to set them explicitly in your shell initialization script. For bash shells, it's either .bashrc or .bash_profile. For the Z shell, it'll be the .zshrc file. In there, step by step, set the environment variables that you need system-wide.

For example, explicitly set your PATH environment variable with explainers that tell future you why you ordered the PATH in a certain way.

# Start with an explicit minimal PATH
export PATH=/bin:/usr/bin:/usr/local/bin

# Add in my custom binaries that I want available across projects
export PATH=$HOME/bin:$PATH

# Add in anaconda installation path
export PATH=$HOME/anaconda/bin:$PATH

# Add more stuff below...

If you want your shell initialization script to be cleaner, you can refactor it out into a second bash script called env_vars.sh, which lives either inside your home directory or your dotfiles repository (see: Leverage dotfiles to get your machine configured quickly). Then, source the env_vars.sh script from the shell initialization script:

source ~/env_vars.sh

There may be a chance that other things, like the Anaconda installer, will give you an option to modify your shell initializer script. If so, be sure to keep this in the back of your mind. At the end, of your shell initializer script, you can echo the final state of environment variables to help you debug.

Environment variables that need to be set on a per-project basis are handled slightly differently. See Create runtime environment variable configuration files for each of your projects.

Create one conda environment per project

Why use one conda environment per project

If you have multiple projects that you work on, but you install all project dependencies into a shared environment, then I guarantee you that at some point, you will run into dependency conflicts as you try to upgrade/update packages to try out new things.

"So what?" you might ask. Well, you'll end up breaking your code! Take this word of advice from someone who has had to deal with the consequences of having his code not working in one project even as code in another does. And finding out one day before an important presentation, right when you need to put in new versions of figures that were made before. The horror!

You will want to ensure that you have an isolated conda environment for each project to keep your projects insulated from one another.

How do you set up your conda environment files

Here is a baseline that you can copy and modify at any time.

name: project-name-goes-here  ## CHANGE THIS TO YOUR ACTUAL PROJECT
channels:      ## Add any other channels below if necessary
- conda-forge
dependencies:  ## Prioritize conda packages
- python=3.10
- jupyter
- conda
- mamba
- ipython
- ipykernel
- numpy
- matplotlib
- scipy
- pandas
- pip
- pre-commit
- black
- nbstripout
- mypy
- flake8
- pycodestyle
- pydocstyle
- pytest
- pytest-cov
- pytest-xdist
- pip:  ## Add in pip packages if necessary
  - mkdocs
  - mkdocs-material
  - mkdocstrings
  - mknotebooks

If a package exists in both conda-forge and pip and you rely primarily on conda, then I recommend prioritizing the conda package over the pip package. The advantage here is that conda's dependency solver can grab the latest compatible version without worrying about pip clobbering over other dependencies. (h/t my reviewer Simon, who pointed out that newer versions of pip have a dependency solver, though as far as possible, staying consistent is preferable, though mixing-and-matching is alright if you know what you're doing.)

This baseline helps me bootstrap conda environments. The packages that are in there each serve a purpose. You can read more about them on the page: Install code checking tools to help write better code.

How do you decide which versions of packages to use?

Initially, I only specify the version of Python I want, and allow the conda package manager to solve the environment.

However, there may come a time when a new package version brings a new capability. That is when you may wish to pin the version of that particular package to be at the minimum that version. (See below for the syntax needed to pin a version.) At the same time, the new package version may break compatibility -- in this case, you will want to pin it to a maximum package version.

It's not always obvious, though, so be sure to use version control

If you wish, you can also pin versions to a minimum, maximum, or specific one, using version modifiers.

For conda, they are >, >=, =, <= and <. (You should be able to grok what is what!)
For pip, they are >, >=, ==, <= and <. (Note: for pip, it is double equals == and not single equals =.)

So when do you use each of the modifiers?

Use =/== sparingly while in development: you will be stuck with a particular version and will find it difficult to update other packages together.
Use <= and < to prevent conda/pip from upgrading a package beyond a certain version. This can be helpful if new versions of packages you rely on have breaking API changes.
Use >= and > to prevent conda/pip from installing a package below a certain version. This is helpful if you've come to depend on breaking API changes from older versions.

When do you upgrade/install new packages?

Upgrading and/or installing packages should be done on an as-needed basis. There are two paths to do upgrade packages that I have found:

The principled way

The principled way to do an upgrade is to first pin the version inside environment.yml, and then use the following command to update the environment:

conda env update -f environment.yml

The hacky way

The hacky way to do the upgrade is to directly conda or pip install the package, and then add it (or modify its version) in the environment.yml file. Do this only if you know what you're doing!

Ensure your environment kernels are available to Jupyter

By practicing "one project gets one environment", then ensuring that those environments' Python interpreters are available to Jupyter is going to be crucial. If you find that your project's environment Python is unavailable, then you'll need to ensure that it's available. To do so, ensure that the Python environment has the package ipykernel. (If not, install it by hand and add it to the environment.yml file.) Then, run the following command:

# assuming you have already activated your environment,
# replace $ENVIRONMENT_NAME with your environment's name.
python -m ipykernel install --user --name $ENVIRONMENT_NAME

Now, it will show up as a "kernel" for executing Python code in your Jupyter notebooks. (see Configure Jupyter and Jupyter Lab for more information on how to configure it.)

Further tips

Now, how should you name your conda environment? See the page: Sanely name things consistently!

Create runtime environment variable configuration files for each of your projects

Why configure environment variables per project

When you work on your projects, one assumption you will usually have is that your development environment will look like your project's runtime environment with all of its environment variables. The runtime environment is usually your "production" setting: a web app or API, a model in a pipeline, or a software package that gets distributed. (For more on environment variables, see: Take full control of your shell environment variables)

How to configure environment variables for your project

Here, I'm assuming that you follow the practice of

and that you Use pyprojroot to define relative paths to the project root.

To configure environment variables for your project, a recommended practice is to create a .env file in your project's root directory, which stores your environment variables as such:

export ENV_VAR_1 = "some_value"
export DATABASE_CONNECTION_STRING = "some_database_connection_string"
export ENV_VAR_3 = "some_other_value"

We use the export syntax here because we can, in our shells, run the command source .env and have the environment variables defined in there applied to our environment.

Now, if you're using a Python project, make sure you have the package python-dotenv (Github repo here) installed in the conda environment. Then, in your Python .py source files:

from dotenv import load_dotenv
from pyprojroot import here
import os

dotenv_path = here() / ".env"
load_dotenv(dotenv_path=dotenv_path)  # this will load the .env file in your project directory root.

# Now, get the environment variable.
DATABASE_CONNECTION_STRING = os.getenv("DATABASE_CONNECTION_STRING")

In this way, your runtime environment variables get loaded into the runtime environment, and become available to all child processes started from within the shell (e.g. Jupyter Lab, or Python, etc.).

Always gitignore your .env file

Your .env file might contain some sensitive secrets. You should always ensure that your .gitignore file contains .env in it.

Keep your notebooks organized with logical categories

In my experience, there are three types of notebooks that get written.

Prototyping notebooks go under `notebooks/`

These notebooks are drafting grounds for "production" code. We use Jupyter notebooks as an experimentation playground. (see: Use Jupyter as an experimentation playground). They do not need to be kept running reliably/reproducibly, and essentially are considered "disposable".

If you are collaborating with colleagues on a project, you can categorize notebooks by their primary author. For example, if I am working with Lily and Arkadij on a project, we can each get our own "user spaces" in there while agreeing not to touch each other's notebooks:

project/
- notebooks/
  - lily/     # lily's notebooks go here
  - arkadij/  # arkadij's notebooks go here
  - eric/     # eric's notebooks go here

Documentation notebooks go under `docs/`

These notebooks are written in the original spirit of Jupyter notebooks. They combine prose, code and code-generated figures. They contain a narrative, a data story. One may say they are "production", in that someone will read them and act on them. They need to be reliably executed from top-to-bottom, usually in a continuous integration system. (see: Build a continuous integration pipeline for your source) using MkDocs and mknotebooks.

For these notebooks, we might choose to keep them in the docs/ directory:

project/
- docs/
  - some_notebook.ipynb

Application notebooks go under `app/`

Sometimes you might opt to use voila to build front-end applications for those whom you serve. This is a convenient option because you don't have to jump out of a Jupyter context if you're already in there. These notebooks are considered "production" as well, however because they are code embedded in JSON, they are more difficult to diff with git.

For these notebooks, you probably want to keep them in a directory named app, where anything that becomes front-facing to the clients we serve are stored:

project/
- apps/
  - notebook_app.ipynb

Install Anaconda on your machine

What is anaconda

Anaconda is a way to get a Python installed on your system.

One of the neat but oftentimes confusing things about Python is that you can have multiple Python executables living around on your system. Anaconda makes it easy for you to:

Obtain Python
Manage different Python versions into isolated environments using a consistent interface
Install packages into these environments

Why use anaconda (or one of its variants)?

Why is this a good thing? Primarily because you might have individual projects that need different version of Python and different versions of packages that are built for Python. Also, default Python installations, such as the ones shipped with older versions of macOS, tend to be versions behind the latest, which is to the detriment of your projects. Some built-in apps in an operating system may depend on that old version of Python (such as iPhoto), which means if you mess up the installation, you might break those built-in apps. Hence, you will want a tool that lets you easily create isolated Python environments.

The Anaconda Python distribution fulfills the following key needs:

You'll be able to create isolated environments on a per-project basis. (see: Follow the rule of one-to-one in managing your projects)
You'll be able to install packages into those isolated environments, and evolve them over time. (see: Create one conda environment per project)

Installing Anaconda on your local machine thus helps you get easy access to Python, Jupyter (see: Use Jupyter as an experimentation playground), and other tools for modelling and analysis.

How to get anaconda?

To install the Miniforge variant of Anaconda, which will be lighter-weight than the full Anaconda distribution, using the following command:

cd ~
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" -O anaconda.sh

This will send you to your home directory, and then download the Miniforge bash script installer from Anaconda's download page as anaconda.sh.

Now, install Anaconda:

bash anaconda.sh -b -p $HOME/anaconda/

This will install the Anaconda distribution of Python onto your system inside your home directory. You can now install packages at will, without needing sudo privileges!

What are the use cases for Jupyter?

How do you get Jupyter?

How do you get Jupyter to recognize your environment's Python?

How do you launch Jupyter?

Pages that link here

Why control your environment variables

How do I control my environment variables

Why use one conda environment per project

How do you set up your conda environment files

How do you decide which versions of packages to use?

When do you upgrade/install new packages?

The principled way

The hacky way

Ensure your environment kernels are available to Jupyter

Further tips

Why configure environment variables per project

How to configure environment variables for your project

Always gitignore your .env file

Prototyping notebooks go under `notebooks/`

Documentation notebooks go under `docs/`

Application notebooks go under `app/`

What is anaconda

Why use anaconda (or one of its variants)?

How to get anaconda?

Next steps

Level-up your conda skills

What are the use cases for Jupyter?

How do you get Jupyter?

How do you get Jupyter to recognize your environment's Python?

How do you launch Jupyter?

Pages that link here

Why control your environment variables

How do I control my environment variables

Why use one conda environment per project

How do you set up your conda environment files

How do you decide which versions of packages to use?

When do you upgrade/install new packages?

The principled way

The hacky way

Ensure your environment kernels are available to Jupyter

Further tips

Why configure environment variables per project

How to configure environment variables for your project

Always gitignore your .env file

Prototyping notebooks go under notebooks/

Documentation notebooks go under docs/

Application notebooks go under app/

What is anaconda

Why use anaconda (or one of its variants)?

How to get anaconda?

Next steps

Level-up your conda skills

Prototyping notebooks go under `notebooks/`

Documentation notebooks go under `docs/`

Application notebooks go under `app/`