Data Science Bootstrap

Navigate the packaging world

Where do we get our software from? Most commonly, they come from package repositories that we interact with using package managers. Package managers come in many flavours. At least in the Python data science world, there are 2-3 package managers that one needs to be aware of, as the ecosystem is big and wide. Oftentimes we have to compose them together.

The overview is that there are "conda" packages, which share a large overlap with "pip" packages, and both share very little overlap with "system" packages.

Prioritize conda to install packages
Use pip only when you cannot find packages on conda
Use docker containers for system-level packages -- this is where we need Docker containers.

These are the general "levels" of abstraction at which packages can be installed:

"Project-specific" - at which environment.yml comes into play
"User-specific" - at which Homebrew comes into play
"System-wide" - for which your system package manager comes into play (if applicable)

Be sure to know what is the best level of abstraction that you need in order to compose together the toolset that you need!

Install Docker on your machine

Why you will need Docker

Docker is used in a myriad of ways, but here are the main reasons I see a data scientist will want to use Docker:

It solves the "it works on my machine" headache.
Those of us who operate as full-stack data scientists might sometimes be involved in deploying models as apps with front-ends for those whom we serve.
Vicki Boykis said so.

While conda environments give you everything that you need for the practice of data science on your local (or remote) machine, Docker containers will give you the ultimate portability. From a technical standpoint, conda environments package data science packages, stopping the stack at stuff that ships with the operating system, which your project might unavoidably depend on. Docker lets you ship an entire operating system + anything else you install in it.

Docker (the company) has a few good reasons why you would want to use Docker (the tool), and you can read about them here. Those reasons you read about on the Docker website are likely also applicable

How do you install Docker

Install the Desktop client first. Don't worry about registering for an account on Docker Hub, though that can be useful later.

If you're on Linux, there are a few guides you can follow, my favourite being the curated guides from DigitalOcean.

If you do a quick Google search for "install docker" and tag on your operating system name, look out first for the Digital Ocean tutorials, which are in my opinion the best maintained.

Take full control of your shell environment variables

Why control your environment variables

If you're not sure what environment variables are, I have an essay on them that you can reference. Mastering environment variables is crucial for data scientists!

Your shell environment, whether it is zsh or bash or fish or something else, is supremely important. It determines the runtime environment, which in turn determines which Python you're using, whether you have proxies set correctly, and more. Rather than leave this to chance, I would recommend instead gaining full control over your environment variables.

How do I control my environment variables

The simplest way is to set them explicitly in your shell initialization script. For bash shells, it's either .bashrc or .bash_profile. For the Z shell, it'll be the .zshrc file. In there, step by step, set the environment variables that you need system-wide.

For example, explicitly set your PATH environment variable with explainers that tell future you why you ordered the PATH in a certain way.

# Start with an explicit minimal PATH
export PATH=/bin:/usr/bin:/usr/local/bin

# Add in my custom binaries that I want available across projects
export PATH=$HOME/bin:$PATH

# Add in anaconda installation path
export PATH=$HOME/anaconda/bin:$PATH

# Add more stuff below...

If you want your shell initialization script to be cleaner, you can refactor it out into a second bash script called env_vars.sh, which lives either inside your home directory or your dotfiles repository (see: Leverage dotfiles to get your machine configured quickly). Then, source the env_vars.sh script from the shell initialization script:

source ~/env_vars.sh

There may be a chance that other things, like the Anaconda installer, will give you an option to modify your shell initializer script. If so, be sure to keep this in the back of your mind. At the end, of your shell initializer script, you can echo the final state of environment variables to help you debug.

Environment variables that need to be set on a per-project basis are handled slightly differently. See Create runtime environment variable configuration files for each of your projects.

Create runtime environment variable configuration files for each of your projects

Why configure environment variables per project

When you work on your projects, one assumption you will usually have is that your development environment will look like your project's runtime environment with all of its environment variables. The runtime environment is usually your "production" setting: a web app or API, a model in a pipeline, or a software package that gets distributed. (For more on environment variables, see: Take full control of your shell environment variables)

How to configure environment variables for your project

Here, I'm assuming that you follow the practice of

and that you Use pyprojroot to define relative paths to the project root.

To configure environment variables for your project, a recommended practice is to create a .env file in your project's root directory, which stores your environment variables as such:

export ENV_VAR_1 = "some_value"
export DATABASE_CONNECTION_STRING = "some_database_connection_string"
export ENV_VAR_3 = "some_other_value"

We use the export syntax here because we can, in our shells, run the command source .env and have the environment variables defined in there applied to our environment.

Now, if you're using a Python project, make sure you have the package python-dotenv (Github repo here) installed in the conda environment. Then, in your Python .py source files:

from dotenv import load_dotenv
from pyprojroot import here
import os

dotenv_path = here() / ".env"
load_dotenv(dotenv_path=dotenv_path)  # this will load the .env file in your project directory root.

# Now, get the environment variable.
DATABASE_CONNECTION_STRING = os.getenv("DATABASE_CONNECTION_STRING")

In this way, your runtime environment variables get loaded into the runtime environment, and become available to all child processes started from within the shell (e.g. Jupyter Lab, or Python, etc.).

Always gitignore your .env file

Your .env file might contain some sensitive secrets. You should always ensure that your .gitignore file contains .env in it.

Install homebrew on your machine

Why install Homebrew?

Your Mac comes with a lot of neat apps, but it's a bit crippled when it comes to shell utilities. (Linux machines can use Homebrew too! Read on to see when you might need it.)

As claimed, Homebrew is the missing package manager for the Mac. From it, you can get shell utilities and apps that don't come pre-installed on your computer, such as wget. Installing these shell utilities can give you a leg-up as you strive to gain mastery over your machine. (see: Install a suite of really cool utilities on your machine using homebrew)

How do we install Homebrew?

Follow the instructions on the homebrew website, but essentially, it's a one bash command install. Usually, you would copy/paste it from the homebrew website, but I've copied it over so you don't have to context-switch:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

It can be executed anywhere, but if you're feeling superstitious, you can always move to your home directory first (cd ~) before executing the command.

Once you're done...

If you're planning to install Anaconda Install Anaconda on your machine, then make sure you install wget, as my bootstrap step for installing Anaconda relies on using wget to pull the installer from the internet.

brew install wget

You can also install some other cool utilities using brew! (see: Install a suite of really cool utilities on your machine using homebrew)

What about Linux machines?

Linux machines usually come with their own package manager, such as yum on CentOS and apt on Ubuntu. If you have the necessary privileges to install packages, which usually means having sudo privileges on your machine, then you probably don't need to install Homebrew on Linux.

However, if you do not have sudo privileges on your machine, then you should consider installing Homebrew inside your home directory. This enables you to use brew to install Linux utilities that might not be built-in to your system. It's a pretty neat hack to have when you're working on a managed system, such as a high performance computing system.

Why you might need to use Docker

How do we use Docker

Pages that link here