Create one conda environment per project

Why use one conda environment per project

If you have multiple projects that you work on, but you install all project dependencies into a shared environment, then I guarantee you that at some point, you will run into dependency conflicts as you try to upgrade/update packages to try out new things.

"So what?" you might ask. Well, you'll end up breaking your code! Take this word of advice from someone who has had to deal with the consequences of having his code not working in one project even as code in another does. And finding out one day before an important presentation, and you have to put out figures. The horror!

You will want to ensure that you have an isolated conda environment for each project to keep your projects insulated from one another.

How do you set up your conda environment files

Here is a baseline that you can copy and modify at any time.

name: project  ## CHANGE THIS TO YOUR ACTUAL PROJECT
channels:      ## Add any other channels below if necessary
- conda-forge
dependencies:  ## Prioritize conda packages
- python=3.8
- jupyter
- conda
- mamba
- ipython
- ipykernel
- numpy
- matplotlib
- scipy
- pandas
- pip
- pre-commit
- black
- nbstripout
- mypy
- flake8
- pycodestyle
- pydocstyle
- pytest
- pytest-cov
- pytest-xdist
- pip:  ## Add in pip packages if necessary
  - mkdocs
  - mkdocs-material
  - mkdocstrings
  - mknotebooks

If a package exists in both conda-forge and pip and you rely primarily on conda, then I recommend prioritizing the conda package over the pip package. The advantage here is that conda's dependency solver can grab the latest compatible version without worrying about pip clobbering over other dependencies. (h/t my reviewer Simon, who pointed out that newer versions of pip have a dependency solver, though as far as possible, staying consistent is preferable, though mixing-and-matching is alright if you know what you're doing.)

This baseline helps me bootstrap conda environments. The packages that are in there each serve a purpose. You can read more about them on the page: Install code checking tools to help write better code.

How do you decide which versions of packages to use?

Initially, I only specify the version of Python I want, and allow the conda package manager to solve the environment.

However, there may come a time when a new package version brings a new capability. That is when you may wish to pin the version of that particular package to be at the minimum that version. (See below for the syntax needed to pin a version.) At the same time, the new package version may break compatibility -- in this case, you will want to pin it to a maximum package version.

It's not always obvious, though, so be sure to use version control

If you wish, you can also pin versions to a minimum, maximum, or specific one, using version modifiers.

  • For conda, they are >, >=, =, <= and <. (You should be able to grok what is what!)
  • For pip, they are >, >=, ==, <= and <. (Note: for pip, it is double equals == and not single equals =.)

So when do you use each of the modifiers?

  • Use =/== sparingly while in development: you will be stuck with a particular version and will find it difficult to update other packages together.
  • Use <= and < to prevent conda/pip from upgrading a package beyond a certain version. This can be helpful if new versions of packages you rely on have breaking API changes.
  • Use >= and > to prevent conda/pip from installing a package below a certain version. This is helpful if you've come to depend on breaking API changes from older versions.

When do you upgrade/install new packages?

Upgrading and/or installing packages should be done on an as-needed basis. There are two paths to do upgrade packages that I have found:

The principled way

The principled way to do an upgrade is to first pin the version inside environment.yml, and then use the following command to update the environment:

conda env update -f environment.yml

The hacky way

The hacky way to do the upgrade is to directly conda or pip install the package, and then add it (or modify its version) in the environment.yml file. Do this only if you know what you're doing!

Ensure your environment kernels are available to Jupyter

By practicing "one project gets one environment", then ensuring that those environments' Python interpreters are available to Jupyter is going to be crucial. If you find that your project's environment Python is unavailable, then you'll need to ensure that it's available. To do so, ensure that the Python environment has the package ipykernel. (If not, install it by hand and add it to the environment.yml file.) Then, run the following command:

# assuming you have already activated your environment,
# replace $ENVIRONMENT_NAME with your environment's name.
python -m ipykernel install --user --name $ENVIRONMENT_NAME

Now, it will show up as a "kernel" for executing Python code in your Jupyter notebooks. (see Configure Jupyter and Jupyter Lab for more information on how to configure it.)

Further tips

Now, how should you name your conda environment? See the page: Sanely name things consistently!

Configure your conda installation

Why you would want to configure your conda installation

Configuring some things with conda can help lubricate your interactions with the conda package manager. It will save you keystrokes at the terminal, primarily, thus saving you time. The place to do this configuration is in the .condarc file, which the conda package manager searches for by default in your user's home directory.

The condarc docs are your best bet for the full configuration, but I have some favourites that I'm more than happy to share below.

How to configure your condarc

Firstly, you create a file in your home directory called .condarc. Then edit it to have the following contents:

channels:
  - conda-forge
  - defaults

auto_update_conda: True

always_yes: True

The whys

  • The auto_update_conda saves me from having to update conda all the time,
  • always_yes lets me always answer y to the conda installation and update prompts.
  • Setting conda-forge as the default channel above the defaults channel allows me to type conda install some_package rather than conda install -c conda-forge some_package each time I want to install a package, as conda will prioritize channels according to their order under the channels section.

About channel priorities

If you prefer, you can set the channel priorities in a different order and/or expand the list. For example, bioinformatics users may want to add in the bioconda channel, while R users may want to add in the r channel. Users who prefer stability may want to prioritize defaults ahead of conda-forge.

What this affects is how conda will look for packages when you execute the conda install command. However, it doesn't affect the channel priority in your per-project environment.yml file (see: Create one conda environment per project).

Other conda-related pages to look at

Use scripts to automate routine execution of tasks

This idea should be pretty obvious. If you find yourself executing the exact same commands over and over and over, you should probably put them together into a bash, Python, or R script that you can call from the root of your directory.

Where should these scripts live?

In the spirit of putting things in categorically relevant places (see: Organize your projects by leveraging categories), you should place them in the scripts/ directory, and provide additional sub-categories inside there.

How do I decide what language to write those scripts in?

You should do what feels most comfortable for you, but there are still some idiomatic guidelines that can help you make a decision:

  • If you're doing text processing of files, or otherwise leveraging functions from your project's custom source, then you might want to write them in Python. (see: Place custom source code inside a lightweight package)
  • If you're doing filesystem manipulation, or repeated serial execution of command line tools, a bash script is a great idea.

What else should I pay attention to when building these scripts?

Design for project root execution

Most of the time, it's optimal to design these scripts assuming that the "current working directory" is project root directory. This will simplify how you execute the scripts. You'll save on injecting "cd" commands into the documentation that you build.

There are exceptions to the rule. For example, if you know that every subsequent operation in the script depends on being in a subdirectory, then setting the current working directory to that subdirectory is a great idea! That age-old adage of "knowing when to break the rules judiciously" applies here.

Leverage Makefiles

If you put your scripts in a scripts/ directory, then constantly executing a command that looks like:

bash scripts/ci/build.sh

can get boring over time. If you instead put that line in a Makefile as follows:

build:
    bash scripts/ci/build.sh

then you can execute the command make build from the project root, and save yourself keystrokes.

Help your colleagues with a "bootstrap" script

You can help your colleagues get setup by creating a script for them! For example, you can write one that has the following commands:

# ./scripts/setup.sh

export PROJECT_ENV_NAME = ______________  # replace with your env name
conda env create -f environment.yml || mamba env create -f $PROJECT_ENV_NAME
conda activate $PROJECT_ENV_NAME

# Install custom source
pip install -e .

# Install Jupyter extensions (if relevant)
jupyter labextension install @jupyter-widgets/jupyterlab-manager

# Install pre-commit hooks
pre-commit install
echo "Setup complete! In the future, run 'conda activate $PROJECT_ENV_NAME' before you run your notebooks."

This script will help you:

  1. Create the conda environment. (see: Create one conda environment per project)
  2. Install the custom source
  3. Install the Jupyterlab IPywidgets extension (necessary for progress bars like tqdm!)
  4. Install pre-commit hooks (see: Set up pre-commit hooks to automate checks before making git commits)

Saves a bunch of time downstream!

Separate computationally expensive steps from computationally cheap steps

If a script is part of a pipeline (see: Build your projects thinking in terms of pipelines), then ensure that you have it set up such that upstream computational steps, especially those that are computationally expensive, execute independent of computationally cheap ones that depend on them. One example, provided by one of my reviewers Simon, is "intermediate data generation" vs. "data visualization". To quote:

I run under the philosophy of not unnecessarily regenerating data. Having to regenerate data -- especially if takes a long time -- just to regenerate a visualization absolutely sucks and is a common cause of my annoyance when my underlings present data in meetings.

Install code checking tools to help write better code

Why install code checking tools?

If you're writing code, then having code checking tools that automatically check for potential issues while you write the code is like having a spell and grammar checker always on. It's also like having Grammarly check your spelling, grammar, and style all the time.

Code style can drift as a project proceeds. If you work with colleagues, code style nitpicks can become a source of frustration in interactions with them. Having code style checkers that automatically flag code style that deviates from a pre-defined norm can go a long way to easing these potential conflicts.

What kind of things should I check for?

Firstly, code style formatting tools. For Python projects:

  • Use black, the uncompromising code formatting tool and don't ask questions.
  • Use isort and don't ask questions. It will sort your imports for you. You'll love the magic, I guarantee it!

You can configure black and isort using a pyproject.toml config file.

Secondly, code problems. For Python projects:

  • Use interrogate to identify which functions don't have any docstrings attached to them.
  • Use pylint to find potential code errors, like dangling or unused variables, or variables used without declaring them beforehand.
  • If starting a new codebase, get MyPy in your project ASAP. This is also a great form of documentation, which is related to Write effective documentation for your projects.

You can configure some of these tools to work with Python source files in VSCode! (see: Use VSCode to help you with software development and collaboration) For example, you can configure VSCode to format your code using black on every save, so you don't have to keep running black before committing your code.

Many authors have written tons of enthusiastic blog posts on Python code style tooling. Here's a sampling of them:

Code style and code format tooling can be a rabbit hole that you run down. Those that I have listed above should give you a great starting point.

Configure Jupyter and Jupyter Lab

Project Jupyter is a cornerstone of the data science ecosystem. When getting set up to work with Jupyter, here are some settings that you may wish to configure if you are running your own Jupyter server (and don't have a managed one provided).

Jupyter configuration files

The best way to learn about Jupyter's configuration system is to read the official documentation. However, I will strive to provide an up-to-date overview here.

You can find Jupyter configuration files in your home directory at the directory ~/.jupyter/. Jupyter's server, which is what you execute when you execute jupyter lab (as of Jan 2021, the lab interface is the future of Jupyter).

To configure Jupyter's behaviour, you usually start by generating the configuration file:

jupyter lab --generate-config

Now, there will be a file created at ~/.jupyter/jupyter_lab_config.py. You can edit that file to configure Jupyter's behaviour.

Change IP address to 0.0.0.0

By convention, Jupyter's server runs on 127.0.0.1, which is aliased by localhost. Usually, this is not a problem if you run the server from your local machine and then access it by a browser on the same device. However, this configuration can get troublesome if you run Jupyter on a remote workstation or server.

Here's why.

When configured to run on 127.0.0.1, the Jupyter server will only accept browser requests originating from the local machine. It will deny browser requests that originate from another device will. However, when the Jupyter server runs on 0.0.0.0 instead, it will be able to accept browser requests that originate from outside the server itself (i.e. your laptop).

Are there risks to serving the notebook up on 0.0.0.0? Yes, the primary one is that your notebook server is now accessible to anybody on the same network as the machine that is serving up Jupyter. To mitigate that risk, Jupyter has "token-based authentication" turned on by default; you might also want to consider turning on password protection.

Add password protection

You can turn on password protection by running the following command at the terminal:

jupyter lab password

Jupyter will prompt you for a password and store a hash of the password in the file ~/.jupyter/jupyter_server_config.json. (You don't need to edit this file!)

When you enable password protection, you will be prompted for a password when you access the Jupyter lab interface.

Change the minimum port number

Sometimes you and your colleagues might share the same workstation and run your Jupyter servers on there. Because the default server configuration starts at port 8888, you might end up with the following scenario (primarily if you practice running one Jupyter server per project):

Notebook port Owner
8888 yourself
8889 yourself
8890 your colleague
8891 yourself

To avoid this scenario, you can agree with your colleague that they can keep their configuration, and you can start your port numbers at 9000 (or 10000). To do so, open up ~/.jupyter/jupyter_lab_config.py and set the config traitlet below:

c.ServerApp.port = 9000

Now, your port numbers will start from port 9000 upwards, helping to keep your port number range and your colleagues' port number range effectively separated.

Use Jupyter as an experimentation playground

What are the use cases for Jupyter?

I use Jupyter notebooks in the following ways.

Firstly, I use them as a prototyping environment. They are wonderful, because I can hold the state of a program in memory and interactively modify it until I get what I need out of the program. (This especially saves on time spent re-computing things.)

Secondly, I use Jupyter as an authoring environment for interactive computational teaching material. For example, I structured Network Analysis Made Simple as a series of Jupyter notebooks.

Finally, on occasion, I use Jupyter with ipywidgets and Voila to build out dashboards and interactive applications for my colleagues.

How do you get Jupyter?

Get Jupyter installed in each of your environments, by including it in your environment.yml file. (see: Create one conda environment per project)

Doing so is based on advice I received at SciPy 2016, in which one of the Jupyter developers strongly advised against "global" installations of Jupyter, to avoid package conflicts.

How do you get Jupyter to recognize your environment's Python?

To get Jupyter to recognize the Python interpreter that defined by your conda environment (see: Create one conda environment per project), you need to make sure you have ipykernel installed inside your environment. Then, use the following command:

export ENV_NAME="put_your_environment_name_here"
conda activate $ENV_NAME
python -m ipykernel install --user --name $ENV_NAME

How do you launch Jupyter?

Newcomers to Anaconda are usually spoonfed the GUI, but I am a proponent of launching Jupyter from the terminal because doing so makes us fully aware of our environment, including the environment variables. (see the related: Create runtime environment variable configuration files for each of your projects and Take full control of your shell environment variables)

To launch Jupyter:

  1. Open your shell
  2. Navigate to your project directory
  3. Activate your conda environment
  4. Then launch Jupyter Lab: jupyter lab

In shell terms:

cd /path/to/project/directory
conda activate $ENV_NAME
jupyter lab

Get prepped per project

Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.

Firstly, some overall ideas to ground the specifics:

Some ideas pertaining to Git:

Notes that pertain to organizing files:

Notes that pertain to your compute environment:

And notes that pertain to good coding practices:

Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.

Place custom source code inside a lightweight package

Why write a package for your custom source code

Have you encountered the situation where you create a new notebook, and then promptly copy code verbatim from another notebook with zero modifications?

As you as you did that, you created two sources of truth for that one function.

Now... if you intended to modify the function and test the effect of the modification on the rest of the code, then you still could have done better.

A custom source package that is installed into the conda environment that you have set up will help you refactor code out of the notebook, and hence help you define one source of truth for the entire function, which you can then import anywhere.

How to create a custom source package for a project

Firstly, I'm assuming you are following the ideas laid out in Set up your project with a sane directory structure. Specifically, you have a src/ directory under the project root. Here, I'm going to give you a summary of the official Python packaging tutorial.

In your project src/ directory, ensure you have a few files:

|- src/
   |- setup.py
   |- source_package/  # rename this to the same name as the conda environment
      |- data/         # for all data-related functions
         |- loaders.py # convenience functions for loading data
         |- schemas.py # this is for pandera schemas
      |- __init__.py   # this is necessary
      |- paths.py      # this is for path definitionsme
      |- utils.py      # utiity functions that you might need
      |- ...
   |- tests/
      |- test_utils.py # tests for utility functions
      |- ...

If you're wondering about why we name the source package the same name as our conda environment, it's for consistency purposes. (see: Sanely name things consistently)

If you're wondering about the purpose of paths.py, read this page: Use pyprojroot to define relative paths to the project root

setup.py should look like this:

import setuptools

with open("README.md", "r", encoding="utf-8") as fh:
    long_description = fh.read()

setuptools.setup(
    name="source_package", # Replace with your environment name
    version="0.1",         # Replace with anything that you need
    packages=setuptools.find_packages(),
)

Now, you activate the environment dedicated to your project (see: Create one conda environment per project) and install the custom source package:

conda activate project_environment
cd src
pip install -e .

This will install the source package in development mode. As you continue to add more code into the custom source package, they will be instantly available to you project-wide.

Now, in your projects, you can import anything from the custom source package.

Note: If you've read the official Python documentation on packages, you might see that src/ has nothing special in its name. (Indeed, one of my reviewers, Arkadij Kummer, pointed this out to me.) Having tried to organize a few ways, I think having src/ is better for DS projects than having the setup.py file and source_package/ directory in the top-level project directory. Those two are better isolated from the rest of the project and we can keep the setup.py in src/ too, thus eliminating clutter from the top-level directory.

How often should the package be updated?

As often as you need it!

Also, I would encourage you to avoid releasing the package standalone until you know that it ought to be used as a standalone Python package. Otherwise, you might prematurely bring upon yourself a maintenance burden!

Install Anaconda on your machine

What is anaconda

Anaconda is a way to get a Python installed on your system.

One of the neat but oftentimes confusing things about Python is that you can have multiple Python executables living around on your system. Anaconda makes it easy for you to:

  1. Obtain Python
  2. Manage different Python versions into isolated environments using a consistent interface
  3. Install packages into these environments

Why use anaconda?

Why is this a good thing? Primarily because you might have individual projects that need different version of Python and different versions of packages that are built for Python. Also, default Python installations, such as the ones shipped with older versions of macOS, tend to be versions behind the latest, which is to the detriment of your projects. Some built-in apps in an operating system may depend on that old version of Python (such as iPhoto), which means if you mess up the installation, you might break those built-in apps. Hence, you will want a tool that lets you easily create isolated Python environments.

The Anaconda Python distribution fulfills the following key needs:

  1. You'll be able to create isolated environments on a per-project basis. (see: Follow the rule of one-to-one in managing your projects)
  2. You'll be able to install packages into those isolated environments, and evolve them over time. (see: Create one conda environment per project)

Installing Anaconda on your local machine thus helps you get easy access to Python, Jupyter (see: Use Jupyter as an experimentation playground), and other tools for modelling and analysis.

How to get anaconda?

If you're on macOS: I'm assuming you have installed homebrew (see: Install homebrew on your machine) and wget. Then, install Miniconda, which will be a lighter-weight installer, using the following command:

cd ~
wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O anaconda.sh

This will send you to your home directory, and then download the Miniconda bash script installer from Anaconda's download page.

If you're on Linux: Make sure you have wget available on your system. Then:

cd ~
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O anaconda.sh

This will download the Miniconda installer for Linux operating sytems onto your home directory.

If you don't have wget: You can head over to the Miniconda docs and download the bash installer to whatever location you want (the home directory is a convenient place). Rename it to anaconda.sh to stay compatible with the instructions below.

Now, install Anaconda:

bash anaconda.sh -b -p $HOME/anaconda/

This will install the Anaconda distribution of Python onto your system inside your home directory. You can now install packages at will, without needing sudo privileges!

Next steps

Level-up your conda skills

Follow the rule of one-to-one in managing your projects

What is this rule all about

The one-to-one rule essentially means this. Each project that we work on gets:

In addition, when we name things, such as environment names, repository names, and more, we choose names that are consistent with one another (see: Sanely name things consistently for the reasons why).

Why is this important

Conventions help act as a lubricant - a shortcut for us to interact with others. Adopting the convention of one-to-one mappings helps us manage some of the complexity that may arise in a project.

Some teams have a habit of putting source code in one place (e.g. Bitbucket) and documentation in another (e.g. Confluence). I would discourage this; placing source code and documentation on how to use it next to each other is a much better way to work, because it gives you and your project stakeholders one single source of truth to find information related to a project.

When can we break this rule

A few guidelines can help you decide.

When a source repository matures enough such that you see a submodule that is generalizable beyond the project itself, then it's time to engage the help of a real software developer to refactor that chunk of code out of the source file into a separate package.

When the project matures enough such that there's a natural bifurcation in work that needs more independence from the original repository, then it's time to split the repository into two. At that point, apply the same principles to the new repository.

One project should get one git repository

Why one project should get one Git repository

This helps a ton with organization. When you have one project targeted to one Git repository, you can easily house everything related to that project in that one Git repository. I mean everything. This includes:

In doing so, you have one mental location that you can point to for everything related to a project. This is a saner way of operating than over-engineering the separation of concerns at the beginning, with docs in one place and out-of-sync with the source code in another place... you get where we're going with this point.

How to get this implemented

Easy! Create your Git repo for the project, and then start putting stuff in there :).

Enough said here!

What should you name the Git repo? See the page: Sanely name things consistently

After you have set up your Git repo, make sure to Set up your project with a sane directory structure.

Also, Set up an awesome default gitignore for your projects!

Sanely name things consistently

Why should you name things consistently

Think about the following scenario:

  • Your project is called sales-forecast
  • It lives in a Git repository hosted on GitHub called forecast-2020
  • Your conda environment is named something you copied and pasted from a tutorial, say my_env
  • Your custom source code is named my_source.

Are you going to be able to ever mentally map them to one another? Probably not, though maybe if you did put in the effort to do so, you might be able to. That said, if you work with someone else on the project, you're only going to increase the amount of mental work they need to do to keep things straight.

Now, consider a different scenario:

  • Your project is called Sales Forecast 2020
  • Your Git repository is called sales-forecast-2020
  • Your conda environment is called sales-forecast-2020-env
  • And your custom source code package is called sales_forecast_2020.

Does the latter seem saner? I think so too :).

What constitutes a "sane" name?

I think the following guidelines help:

  1. 2 words are preferred, 3 words are okay, 4 is bordering on verbose; 5 or more words is not really acceptable.
  2. Explicit, precise, and well-defined for a "local" scope, where "local" depends on your definition.

I would add that learning how to name things precisely in English, and hence provide precise variable names in Python code, is a great way for English second language speakers to practice and expand their language vocabulary.

As one of my reviewers (Logan Thomas) pointed out, leveraging the name to help newcomers distinguish between entities is helpful too. For this reason, your environment can be suffixed with a consistent noun; for example, I have -dev as a suffix to make for software package-oriented projects; above, we used -env as a suffix (making sales-forecast-2020-env) to indicate to a newcomer that we're activating an environment when we conda activate sales-forecast-2020-env. As long as you're consistent, that's not a problem!