Data Science Bootstrap

Use Jupyter as an experimentation playground

What are the use cases for Jupyter?

I use Jupyter notebooks in the following ways.

Firstly, I use them as a prototyping environment. They are wonderful, because I can hold the state of a program in memory and interactively modify it until I get what I need out of the program. (This especially saves on time spent re-computing things.)

Secondly, I use Jupyter as an authoring environment for interactive computational teaching material. For example, I structured Network Analysis Made Simple as a series of Jupyter notebooks.

Finally, on occasion, I use Jupyter with ipywidgets and Voila to build out dashboards and interactive applications for my colleagues.

How do you get Jupyter?

Get Jupyter installed in each of your environments, by including it in your environment.yml file. (see: Create one conda environment per project)

Doing so is based on advice I received at SciPy 2016, in which one of the Jupyter developers strongly advised against "global" installations of Jupyter, to avoid package conflicts.

How do you get Jupyter to recognize your environment's Python?

To get Jupyter to recognize the Python interpreter that defined by your conda environment (see: Create one conda environment per project), you need to make sure you have ipykernel installed inside your environment. Then, use the following command:

export ENV_NAME="put_your_environment_name_here"
conda activate $ENV_NAME
python -m ipykernel install --user --name $ENV_NAME

How do you launch Jupyter?

Newcomers to Anaconda are usually spoonfed the GUI, but I am a proponent of launching Jupyter from the terminal because doing so makes us fully aware of our environment, including the environment variables. (see the related: Create runtime environment variable configuration files for each of your projects and Take full control of your shell environment variables)

To launch Jupyter:

Open your shell
Navigate to your project directory
Activate your conda environment
Then launch Jupyter Lab: jupyter lab

In shell terms:

cd /path/to/project/directory
conda activate $ENV_NAME
jupyter lab

Take full control of your shell environment variables

Why control your environment variables

If you're not sure what environment variables are, I have an essay on them that you can reference. Mastering environment variables is crucial for data scientists!

Your shell environment, whether it is zsh or bash or fish or something else, is supremely important. It determines the runtime environment, which in turn determines which Python you're using, whether you have proxies set correctly, and more. Rather than leave this to chance, I would recommend instead gaining full control over your environment variables.

How do I control my environment variables

The simplest way is to set them explicitly in your shell initialization script. For bash shells, it's either .bashrc or .bash_profile. For the Z shell, it'll be the .zshrc file. In there, step by step, set the environment variables that you need system-wide.

For example, explicitly set your PATH environment variable with explainers that tell future you why you ordered the PATH in a certain way.

# Start with an explicit minimal PATH
export PATH=/bin:/usr/bin:/usr/local/bin

# Add in my custom binaries that I want available across projects
export PATH=$HOME/bin:$PATH

# Add in anaconda installation path
export PATH=$HOME/anaconda/bin:$PATH

# Add more stuff below...

If you want your shell initialization script to be cleaner, you can refactor it out into a second bash script called env_vars.sh, which lives either inside your home directory or your dotfiles repository (see: Leverage dotfiles to get your machine configured quickly). Then, source the env_vars.sh script from the shell initialization script:

source ~/env_vars.sh

There may be a chance that other things, like the Anaconda installer, will give you an option to modify your shell initializer script. If so, be sure to keep this in the back of your mind. At the end, of your shell initializer script, you can echo the final state of environment variables to help you debug.

Environment variables that need to be set on a per-project basis are handled slightly differently. See Create runtime environment variable configuration files for each of your projects.

Use pyprojroot to define relative paths to the project root

Why you should use pyprojroot

If you follow the practice of One project should get one git repository, then everything related to the project will be housed inside that repository. Under this assumption, if you also develop a custom source code library for your project (see Place custom source code inside a lightweight package for why), then you'll likely encounter the need to find paths to things, such as data files, relative to the project root. Rather than hard-coding paths into your library and Jupyter notebooks, you can instead leverage pyprojroot to define a library of paths that are useful across the project.

How do you use pyprojroot effectively

Firstly, make sure you have an importable source_package.paths module. (I'm assuming you have written a custom source package!) In there, define project paths:

from pyprojroot import here

root = here(proj_files=[".git"])
notebooks_dir = root / "notebooks"
data_dir = root / "data"
timeseries_data_dir = data_dir / "timeseries"

here() returns a Python pathlib.Path object.

You can go as granular or as coarse-grained as you want.

Then, inside your Jupyter notebooks or Python scripts, you can import those paths as needed.

from source_package.paths import timeseries_data_dir
import pandas as pd

data = pd.read_csv(timeseries_data_dir / "2016-2019.csv")

Now, if for whatever reason you have to move the data files to a different subdirectory (say, to keep things even more organized than you already are, you awesome person!), then you just have to update one location in source_package.paths, and you're able to reference the data file in all of your scripts!

Adhere to best git practices

Why adhere to best Git practices?

Git is a unique piece of software. It does one and only one thing well: store versions of hand-curated files. Adhering to Git best practices will ensure that you use Git in its intended fashion.

What best practices should we adhere to?

The most significant point to keep in mind: only commit to Git files that you have had to create manually. That usually means version controlling:

Source code. See: Place custom source code inside a lightweight package)
Configuration files. See:
1. Create runtime environment variable configuration files for each of your projects
2. Create configuration files for code checking tools
Documentation. See: Write effective documentation for your projects

There are also things you should actively avoid committing.

For specific files, you can set up a .gitignore file. See the page Set up an awesome default gitignore for your projects for more information on preventing yourself from committing them automatically.

For Jupyter notebooks, it is considered good practice to avoid committing notebooks that still have outputs. It is best to clear them out using nbstripout. That can be automated before committing them through the use of pre-commit hooks. (See: Set up pre-commit hooks to automate checks before making Git commits)

One project should get one git repository

Why one project should get one Git repository

This helps a ton with organization. When you have one project targeted to one Git repository, you can easily house everything related to that project in that one Git repository. I mean everything. This includes:

source code (see: Place custom source code inside a lightweight package)
documentation (see: Write effective documentation for your projects)
data descriptors (see: Write data descriptor files for your data sources)
environment/configuration files (see: Create one conda environment per project and Create runtime environment variable configuration files for each of your projects)

In doing so, you have one mental location that you can point to for everything related to a project. This is a saner way of operating than over-engineering the separation of concerns at the beginning, with docs in one place and out-of-sync with the source code in another place... you get where we're going with this point.

How to get this implemented

Easy! Create your Git repo for the project, and then start putting stuff in there :).

Enough said here!

What should you name the Git repo? See the page: Sanely name things consistently

After you have set up your Git repo, make sure to Set up your project with a sane directory structure.

Also, Set up an awesome default gitignore for your projects!

Set up an awesome default gitignore for your projects

Why setup a "gitignore" file?

There will be some files you'll never want to commit to Git. Some include:

Files that contain passwords and other secrets.
Files that contain runtime environment variables (which themselves might be secrets).
Large files, such as images and binaries, unless they are essential assets. (A rule of thumb is anything >500 kb is "large" by Git standards.)
Jupyter notebooks that contain outputs.
Data file directories. (see: Never commit data into version control repositories)

If you commit them, then:

Secrets and other sensitive runtime information may linger in your repository and become exposed to the world.
Your repository will explode in history as changes happen to the large binary files.

How do I set up an awesome "gitignore" file?

Some believe that your .gitignore should be curated. I believe that you should use a good default one that is widely applicable. To do so, go to gitignore.io, fill in the languages and operating systems involved in your project, and copy/paste the one that fits you. If you want an awesome default one for Python:

cd /path/to/project/root
curl https://www.toptal.com/developers/gitignore/api/python

It will have .env available in there too! (see: Create runtime environment variable configuration files for each of your projects)

How is a `.gitignore` file parsed?

A .gitignore file is parsed according to the rules on its documentation page. It essentially follows the unix glob syntax while adding on logical modifiers. Here are a few examples to get you oriented:

Example 1: Ignore all `.DS_Store` files

These are files generated by macOS' Finder. You can ignore them by appending the following line to your .gitignore:

*.DS_Store

Example 2: Ignore all files under `site/`

If you use MkDocs to build documentation, it will place the output into the directory site/. You will want to ignore the entire directory appending the following line:

site/

Example 3: Ignore all `.ipynb_checkpoints` directories

If you have Jupyter notebooks inside your repository, you can ignore any path containing .ipynb_checkpoints.

.ipynb_checkpoints

Adding this line will prevent your Jupyter notebook checkpoints from being committed into your Git repository.

Get prepped per project

Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.

Firstly, some overall ideas to ground the specifics:

Some ideas pertaining to Git:

Notes that pertain to organizing files:

Notes that pertain to your compute environment:

And notes that pertain to good coding practices:

Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.

Use docker containers for system-level packages

Why you might need to use Docker

If conda environments are such a great environment isolation tool, why would we need Docker?

That's because sometimes, your project might have an unavoidable dependency on system-level packages. I have seen some projects that use spatial mapping tooling require system-level packages. Others that depend on audio processing might require packages that can only be obtained outside of conda. In these cases, yes, installing them locally on your machine can be handy (see Install homebrew on your machine), but if you're also interested in building an app, then you'll need them packaged up inside a Docker container.

What is a Docker container? The best anchoring way to thinking about it is a fully-fledged operating system completely insulated from its host (i.e. your computer). It has no knowledge of your runtime environment variables (see: Create runtime environment variable configuration files for each of your projects and Take full control of your shell environment variables). It's like having a completely clean operating system, without the cost of buying new hardware.

How do we use Docker

I'm assuming you've already obtained Docker on your system. (see: Install Docker on your machine).

The core thing you need to know how to write is a Dockerfile. This file specifies exactly how a Docker container is to be built. The easiest way to think about the Dockerfile syntax is that it's almost bash, with a bit more additional syntax. The Docker docs give an extremely thorough tutorial. For those who are more hands-on, I recommend pair coding with another more experienced individual who is willing to teach you the ropes, to build a Docker container when it becomes relevant to your problem.

Why configure environment variables per project

How to configure environment variables for your project

Always gitignore your .env file

Pages that link here

What are the use cases for Jupyter?

How do you get Jupyter?

How do you get Jupyter to recognize your environment's Python?

How do you launch Jupyter?

Why control your environment variables

How do I control my environment variables

Why you should use pyprojroot

How do you use pyprojroot effectively

Why adhere to best Git practices?

What best practices should we adhere to?

Why one project should get one Git repository

How to get this implemented

Why setup a "gitignore" file?

How do I set up an awesome "gitignore" file?

How is a `.gitignore` file parsed?

Example 1: Ignore all `.DS_Store` files

Example 2: Ignore all files under `site/`

Example 3: Ignore all `.ipynb_checkpoints` directories

Why you might need to use Docker

How do we use Docker

Why configure environment variables per project

How to configure environment variables for your project

Always gitignore your .env file

Pages that link here

What are the use cases for Jupyter?

How do you get Jupyter?

How do you get Jupyter to recognize your environment's Python?

How do you launch Jupyter?

Why control your environment variables

How do I control my environment variables

Why you should use pyprojroot

How do you use pyprojroot effectively

Why adhere to best Git practices?

What best practices should we adhere to?

Why one project should get one Git repository

How to get this implemented

Why setup a "gitignore" file?

How do I set up an awesome "gitignore" file?

How is a .gitignore file parsed?

Example 1: Ignore all .DS_Store files

Example 2: Ignore all files under site/

Example 3: Ignore all .ipynb_checkpoints directories

Why you might need to use Docker

How do we use Docker

How is a `.gitignore` file parsed?

Example 1: Ignore all `.DS_Store` files

Example 2: Ignore all files under `site/`

Example 3: Ignore all `.ipynb_checkpoints` directories