Data Science Bootstrap

Install Anaconda on your machine

What is anaconda

Anaconda is a way to get a Python installed on your system.

One of the neat but oftentimes confusing things about Python is that you can have multiple Python executables living around on your system. Anaconda makes it easy for you to:

Obtain Python
Manage different Python versions into isolated environments using a consistent interface
Install packages into these environments

Why use anaconda (or one of its variants)?

Why is this a good thing? Primarily because you might have individual projects that need different version of Python and different versions of packages that are built for Python. Also, default Python installations, such as the ones shipped with older versions of macOS, tend to be versions behind the latest, which is to the detriment of your projects. Some built-in apps in an operating system may depend on that old version of Python (such as iPhoto), which means if you mess up the installation, you might break those built-in apps. Hence, you will want a tool that lets you easily create isolated Python environments.

The Anaconda Python distribution fulfills the following key needs:

You'll be able to create isolated environments on a per-project basis. (see: Follow the rule of one-to-one in managing your projects)
You'll be able to install packages into those isolated environments, and evolve them over time. (see: Create one conda environment per project)

Installing Anaconda on your local machine thus helps you get easy access to Python, Jupyter (see: Use Jupyter as an experimentation playground), and other tools for modelling and analysis.

How to get anaconda?

To install the Miniforge variant of Anaconda, which will be lighter-weight than the full Anaconda distribution, using the following command:

cd ~
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" -O anaconda.sh

This will send you to your home directory, and then download the Miniforge bash script installer from Anaconda's download page as anaconda.sh.

Now, install Anaconda:

bash anaconda.sh -b -p $HOME/anaconda/

This will install the Anaconda distribution of Python onto your system inside your home directory. You can now install packages at will, without needing sudo privileges!

Next steps

Level-up your conda skills

Pages that link here

Configure your machine
After getting access to your development machine, you'll want to configure it and take full control over how it works

Configure your conda installation
Why you would want to configure your conda installation Configuring some things with conda can help lubricate your interactions with the conda package manager

Install homebrew on your machine
Why install Homebrew? Your Mac comes with a lot of neat apps, but it's a bit crippled when it comes to shell utilities

Use Jupyter as an experimentation playground

What are the use cases for Jupyter?

I use Jupyter notebooks in the following ways.

Firstly, I use them as a prototyping environment. They are wonderful, because I can hold the state of a program in memory and interactively modify it until I get what I need out of the program. (This especially saves on time spent re-computing things.)

Secondly, I use Jupyter as an authoring environment for interactive computational teaching material. For example, I structured Network Analysis Made Simple as a series of Jupyter notebooks.

Finally, on occasion, I use Jupyter with ipywidgets and Voila to build out dashboards and interactive applications for my colleagues.

How do you get Jupyter?

Get Jupyter installed in each of your environments, by including it in your environment.yml file. (see: Create one conda environment per project)

Doing so is based on advice I received at SciPy 2016, in which one of the Jupyter developers strongly advised against "global" installations of Jupyter, to avoid package conflicts.

How do you get Jupyter to recognize your environment's Python?

To get Jupyter to recognize the Python interpreter that defined by your conda environment (see: Create one conda environment per project), you need to make sure you have ipykernel installed inside your environment. Then, use the following command:

export ENV_NAME="put_your_environment_name_here"
conda activate $ENV_NAME
python -m ipykernel install --user --name $ENV_NAME

How do you launch Jupyter?

Newcomers to Anaconda are usually spoonfed the GUI, but I am a proponent of launching Jupyter from the terminal because doing so makes us fully aware of our environment, including the environment variables. (see the related: Create runtime environment variable configuration files for each of your projects and Take full control of your shell environment variables)

To launch Jupyter:

Open your shell
Navigate to your project directory
Activate your conda environment
Then launch Jupyter Lab: jupyter lab

In shell terms:

cd /path/to/project/directory
conda activate $ENV_NAME
jupyter lab

Use Mamba as a faster drop-in replacement for conda

What is mamba

Mamba is a project originally developed by the Quantstack team. They went in and solved some of the annoyances with the conda package manager - specifically the problem of how long it takes to solve an environment specification.

How do you get mamba

Mamba is available on conda-forge and PyPI. Follow the instructions on the mamba repo to install it.

Alias mamba to conda

If you have muscle memory and want to make the switch from conda to mamba as easy as possible, you can use a shell alias inside your sourced .aliases file:

alias conda="mamba"

See the page Create shell command aliases for your commonly used commands for more information on shell aliases.

Follow the rule of one-to-one in managing your projects

What is this rule all about

The one-to-one rule essentially means this. Each project that we work on gets:

One Git repository (see: One project should get one git repository)
One conda environment (see: Create one conda environment per project)
One custom source package inside the repo (see: Place custom source code inside a lightweight package)
One documentation source inside the repo (see: Write effective documentation for your projects), including a well-maintained README file
One continuous integration pipeline (see: Build a continuous integration pipeline for your source)

In addition, when we name things, such as environment names, repository names, and more, we choose names that are consistent with one another (see: Sanely name things consistently for the reasons why).

Why is this important

Conventions help act as a lubricant - a shortcut for us to interact with others. Adopting the convention of one-to-one mappings helps us manage some of the complexity that may arise in a project.

Some teams have a habit of putting source code in one place (e.g. Bitbucket) and documentation in another (e.g. Confluence). I would discourage this; placing source code and documentation on how to use it next to each other is a much better way to work, because it gives you and your project stakeholders one single source of truth to find information related to a project.

When can we break this rule

A few guidelines can help you decide.

When a source repository matures enough such that you see a submodule that is generalizable beyond the project itself, then it's time to engage the help of a real software developer to refactor that chunk of code out of the source file into a separate package.

When the project matures enough such that there's a natural bifurcation in work that needs more independence from the original repository, then it's time to split the repository into two. At that point, apply the same principles to the new repository.

Quickly remove a conda environment without using conda env remove

Why would you want to do this

Removing conda environments is usually done with the following command at the terminal:

conda env remove -n some_environment_name

Sometimes, however, the command is slow.

Install homebrew on your machine

Why install Homebrew?

Your Mac comes with a lot of neat apps, but it's a bit crippled when it comes to shell utilities. (Linux machines can use Homebrew too! Read on to see when you might need it.)

As claimed, Homebrew is the missing package manager for the Mac. From it, you can get shell utilities and apps that don't come pre-installed on your computer, such as wget. Installing these shell utilities can give you a leg-up as you strive to gain mastery over your machine. (see: Install a suite of really cool utilities on your machine using homebrew)

How do we install Homebrew?

Follow the instructions on the homebrew website, but essentially, it's a one bash command install. Usually, you would copy/paste it from the homebrew website, but I've copied it over so you don't have to context-switch:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

It can be executed anywhere, but if you're feeling superstitious, you can always move to your home directory first (cd ~) before executing the command.

Once you're done...

If you're planning to install Anaconda Install Anaconda on your machine, then make sure you install wget, as my bootstrap step for installing Anaconda relies on using wget to pull the installer from the internet.

brew install wget

You can also install some other cool utilities using brew! (see: Install a suite of really cool utilities on your machine using homebrew)

What about Linux machines?

Linux machines usually come with their own package manager, such as yum on CentOS and apt on Ubuntu. If you have the necessary privileges to install packages, which usually means having sudo privileges on your machine, then you probably don't need to install Homebrew on Linux.

However, if you do not have sudo privileges on your machine, then you should consider installing Homebrew inside your home directory. This enables you to use brew to install Linux utilities that might not be built-in to your system. It's a pretty neat hack to have when you're working on a managed system, such as a high performance computing system.

Create one conda environment per project

Why use one conda environment per project

If you have multiple projects that you work on, but you install all project dependencies into a shared environment, then I guarantee you that at some point, you will run into dependency conflicts as you try to upgrade/update packages to try out new things.

"So what?" you might ask. Well, you'll end up breaking your code! Take this word of advice from someone who has had to deal with the consequences of having his code not working in one project even as code in another does. And finding out one day before an important presentation, right when you need to put in new versions of figures that were made before. The horror!

You will want to ensure that you have an isolated conda environment for each project to keep your projects insulated from one another.

How do you set up your conda environment files

Here is a baseline that you can copy and modify at any time.

name: project-name-goes-here  ## CHANGE THIS TO YOUR ACTUAL PROJECT
channels:      ## Add any other channels below if necessary
- conda-forge
dependencies:  ## Prioritize conda packages
- python=3.10
- jupyter
- conda
- mamba
- ipython
- ipykernel
- numpy
- matplotlib
- scipy
- pandas
- pip
- pre-commit
- black
- nbstripout
- mypy
- flake8
- pycodestyle
- pydocstyle
- pytest
- pytest-cov
- pytest-xdist
- pip:  ## Add in pip packages if necessary
  - mkdocs
  - mkdocs-material
  - mkdocstrings
  - mknotebooks

If a package exists in both conda-forge and pip and you rely primarily on conda, then I recommend prioritizing the conda package over the pip package. The advantage here is that conda's dependency solver can grab the latest compatible version without worrying about pip clobbering over other dependencies. (h/t my reviewer Simon, who pointed out that newer versions of pip have a dependency solver, though as far as possible, staying consistent is preferable, though mixing-and-matching is alright if you know what you're doing.)

This baseline helps me bootstrap conda environments. The packages that are in there each serve a purpose. You can read more about them on the page: Install code checking tools to help write better code.

How do you decide which versions of packages to use?

Initially, I only specify the version of Python I want, and allow the conda package manager to solve the environment.

However, there may come a time when a new package version brings a new capability. That is when you may wish to pin the version of that particular package to be at the minimum that version. (See below for the syntax needed to pin a version.) At the same time, the new package version may break compatibility -- in this case, you will want to pin it to a maximum package version.

It's not always obvious, though, so be sure to use version control

If you wish, you can also pin versions to a minimum, maximum, or specific one, using version modifiers.

For conda, they are >, >=, =, <= and <. (You should be able to grok what is what!)
For pip, they are >, >=, ==, <= and <. (Note: for pip, it is double equals == and not single equals =.)

So when do you use each of the modifiers?

Use =/== sparingly while in development: you will be stuck with a particular version and will find it difficult to update other packages together.
Use <= and < to prevent conda/pip from upgrading a package beyond a certain version. This can be helpful if new versions of packages you rely on have breaking API changes.
Use >= and > to prevent conda/pip from installing a package below a certain version. This is helpful if you've come to depend on breaking API changes from older versions.

When do you upgrade/install new packages?

Upgrading and/or installing packages should be done on an as-needed basis. There are two paths to do upgrade packages that I have found:

The principled way

The principled way to do an upgrade is to first pin the version inside environment.yml, and then use the following command to update the environment:

conda env update -f environment.yml

The hacky way

The hacky way to do the upgrade is to directly conda or pip install the package, and then add it (or modify its version) in the environment.yml file. Do this only if you know what you're doing!

Ensure your environment kernels are available to Jupyter

By practicing "one project gets one environment", then ensuring that those environments' Python interpreters are available to Jupyter is going to be crucial. If you find that your project's environment Python is unavailable, then you'll need to ensure that it's available. To do so, ensure that the Python environment has the package ipykernel. (If not, install it by hand and add it to the environment.yml file.) Then, run the following command:

# assuming you have already activated your environment,
# replace $ENVIRONMENT_NAME with your environment's name.
python -m ipykernel install --user --name $ENVIRONMENT_NAME

Now, it will show up as a "kernel" for executing Python code in your Jupyter notebooks. (see Configure Jupyter and Jupyter Lab for more information on how to configure it.)

Further tips

Now, how should you name your conda environment? See the page: Sanely name things consistently!

Configure your conda installation

Why you would want to configure your conda installation

Configuring some things with conda can help lubricate your interactions with the conda package manager. It will save you keystrokes at the terminal, primarily, thus saving you time. The place to do this configuration is in the .condarc file, which the conda package manager searches for by default in your user's home directory.

The condarc docs are your best bet for the full configuration, but I have some favourites that I'm more than happy to share below.

How to configure your condarc

Firstly, you create a file in your home directory called .condarc. Then edit it to have the following contents:

channels:
  - conda-forge
  - defaults

auto_update_conda: True

always_yes: True

The whys

The auto_update_conda saves me from having to update conda all the time,
always_yes lets me always answer y to the conda installation and update prompts.
Setting conda-forge as the default channel above the defaults channel allows me to type conda install some_package rather than conda install -c conda-forge some_package each time I want to install a package, as conda will prioritize channels according to their order under the channels section.

About channel priorities

If you prefer, you can set the channel priorities in a different order and/or expand the list. For example, bioinformatics users may want to add in the bioconda channel, while R users may want to add in the r channel. Users who prefer stability may want to prioritize defaults ahead of conda-forge.

What this affects is how conda will look for packages when you execute the conda install command. However, it doesn't affect the channel priority in your per-project environment.yml file (see: Create one conda environment per project).

Other conda-related pages to look at

Bootstrap a scratch conda environment

A scratch environment is your playground

In a pinch, you might want to muck around on your system with some quick-and-dirty experiment. Having a suite of packages inside a scratch environment can be handy. Your scratch environment can be your base environment if you'd like, but I would strongly recommend creating a separate scratch environment instead.

How to bootstrap a scratch environment

I would recommend that you bootstrap a scratch conda environment with some basic data science packages.

mamba activate base
mamba install -c conda-forge \
    scipy numpy pandas matplotlib \
	numpy jupyter jupyterlab \
	scikit-learn ipython ipykernel \
	ipywidgets mamba

(Replace mamba with conda if you don't have mamba installed on your system.)

Doing so gives you an environment where you can quickly prototype new things without necessarily going through the overhead of creating an entirely new project (and with it, a full conda environment).

Installing mamba can be helpful if you want a faster drop-in replacement for conda. (see: Use Mamba as a faster drop-in replacement for conda for more information.)

Variants of Anaconda

If you're a conda user, you may have heard of the Anaconda distribution of Python. In this set of notes, however, I've also referenced the Miniforge distribution of Python. What's the difference here? How do you pick which one to use? To answer those questions, we must first understand what is a distribution of Python.

Python distributions

Python can get distributed to users in many ways. You can download it directly from the official Python Software Foundation's (PSF) website. Or you can install it onto your system using the official Anaconda installer, through Homebrew, or through your official Linux package manager. Each way of installing Python can be thought of as a distribution of Python. Each distribution of Python differs ever so slightly. Official Python from the PSF comes with just the standard library. Anaconda, however, ships with the standard library and many other packages that are relevant for data science.

What is common across all Python distributions, however, is that it will ship with a Python executable that, at the end of installation, should be discoverable on your PATH environment variable.

Most commonly, there will be a Python package installer that ships with the distribution as well. This can be pip, the official tool for installing Python packages, or it could be conda, which was developed by the company Anaconda.

As such, the anatomy of a distribution is essentially nothing more than:

A Python interpreter that can be discovered on your PATH,
A Python package manager, and
Any other default Python packages that the distributor thinks you might want

With that aside, let's look at three distributions of Python that are relevant to this set of notes.

Anaconda Python

The Anaconda distribution of Python is the official distribution from Anaconda. It ships with a modern version of Python, both pip and conda package managers, and a whole slew of default data science packages (pandas, numpy, scikit-learn, scipy, matplotlib, for example). With the Anaconda distribution, conda is configured such that packages are installed from the anaconda repository of packages, hosted by Anaconda itself. Its default installation location is ~/anaconda or ~/anaconda3.

Miniconda Python

The Miniconda Python distribution also comes from Anaconda. It looks like Anaconda except it ships with fewer packages in the base environment. You wouldn't, for example, find pandas installed for you. This was mostly intended to keep the base environment small for use within Docker containers.

Its default installation location is ~/miniconda or ~/miniconda3.

Miniforge Python

This distribution of Python comes from the open-source developer team behind conda-forge. Miniforge looks like Miniconda, but instead of configuring conda to pull packages from the anaconda repository, conda packages are instead pulled from the conda-forge repository of packages by default. This has the advantage of being able to pull more bleeding-edge versions of packages that you may use. Additionally, Miniforge Python ships with mamba as well. (See: Use Mamba as a faster drop-in replacement for conda)

Which to use?

Depends on your persona! If you're an indie hacker type, I would strongly recommend the Miniforge Python as it is lightweight and fast to get set up with and fully open source. On the other hand, if you're more inclined to want enterprise support, vetting of packages, and wish to support a company that backs so much of the Python open source world, then I would recommend reaching out to Anaconda and talking with their sales reps.