Prerequisite Knowledge

Introduction

In this notebook, I will go over with you some fundamental concepts that are super important for data scientists to know. I will also go through with you why they are important, and how ignoring them will impact your workflow in negative ways.

Virtual Environments

Virtual environments are the first foundational concept you should know. This is the process of creating a software stack for a given project with as few things shared with other projects as possible.

Docker containers

At one extreme of virtual environments is Docker containers. Here, the entire software stack encompasses a whole Linux operating system, basically short of "which CPU it runs on". Here, any system libraries, Python packages, C libraries are packaged into a single logical "unit", which can be distributed to others easily as well.

Docker containers are canonically declared in a Dockerfile (though you can name it whatever you want), which contains bash-like instructions (mixed with Docker-specific syntax) to build the environment.

`conda` environments

For data scientists, a more constrained type of environment are conda environments. Here, we leave system-level libraries alone, and instead are focused on packages that we need for our data science work. The line between system-level and project-level is kind of blurry, though a good conceptual way of thinking about this is to assign "packages I import" to the category of data science packages, and everything else to system packages.

conda environments are canonically declared in an environment.yml file (though you can name it whatever you want), and use YAML syntax to declare the environment nam,e what packages you want in the environment, and where to pull them from.

`venv` environments

For Python developers specifically, one level more constrained is to use virtualenv. Here, only the Python interpreter and Python packages are packaged into an isolated software stack that no other projects should touch. venv is quite lightweight, though it is restricted to Python packages only.

venv environments depend only on having venv installed on your system, and use requirements.txt to build the Python environment.

What happens if you don't use a custom environment?

Initially, you might not notice much. But over time, if all of your projects share the same environment, you might end up with conflicting package requirements that cannot be resolved.

I once thought that environments were super troublesome, and installed everything into the macOS system environment. That caused me a ton of troubles a few months later, when I was trying to update my compute stack. I ended up with conflicting packages, and couldn't execute my code reliably, following which I decided to nuked the environment... and promptly cause iPhoto to always crash on launch. (Turns out iPhoto depends on macOS' system Python.)

It was then that I knew: I had to use environments. I chose conda for it was (and still is) the lightest-weight way to create very well-isolated environments on a single compute system.

Once I got the hang of it, every project got its own conda environment, no exceptions. Even this tutorial follows that practice.

Take it from someone who's learned his lesson the hard way: Use conda environments!

Creating `conda` environments

For those of you who did a manual setup, you've already got experience with this!

For others, the generic instructions are as follows. Firstly, cd to the project directory, and ensure that there is an environment.yml file in there. If you want the conda environment to be recognized by an identifiable name in Jupyter, then make sure that the environment spec contains ipykernel under the dependencies section.

Next, create the conda environment specified in the environment.yml file by running the following command at the terminal:

conda env create -f environment.yml

To ensure that the conda environment is recognized by Jupyter, set it up with ipykernel:

export ENVIRONMENT_NAME=".........."  # put your environment's name here.
python -m ipykernel install --user --name $ENVIRONMENT_NAME

Custom Packages

If you didn't know already, it's totally possible for you to build custom packages of your own! I'm going to show you the simplest way of handling a custom source package here.

Assumed project structure

Firstly, let's assume that you have the following project structure:

./
|- docs/
|- notebooks/
|- README.md

Create custom source package

The first thing you'll want to do is make a src directory in your project root. (src is lazy programmer's way of indicating "source".)

mkdir src

Next, you'll want to navigate to your src/ directory and create a setup.py file.

cd src 
touch setup.py

Now, edit setup.py using your favourite text editor. You'll want the bare minimum to look like this:

"""Setup script."""
import os
from setuptools import setup, find_packages

setup(
    # Give it any name you want, but I recommend it being short and easy to type.
    name="project_name",
    # Put a dummy version in place
    version="0.1",
    # Make sure all packages can be found.
    packages=find_packages(),
)

You don't need much else beyond that. Save the text file.

Populate your source code package

Now, you can start creating your library of functions in a package. Let's call the package project_name, which will make it consistent with the name="project_name" kwarg that you put in setup.py:

mkdir project_name
cd project_name 
touch __init__.py

The __init__.py is the "magic" file that enables find_packages() to discover that project_name/ is a Python package to be installed! Go ahead and edit __init__.py, and add in any kind of function you want. If this is your first time making a Python package, then follow me and add the following function inside __init__.py:

def hello():
    print("Hello!")

Now, your src/ directory should look something like this:

./
|- src/
   |- setup.py 
   |- project_name/
      |- __init__.py

You might be wondering now: So yeah, I've defined the function there, but how am I ever going to use this function?

Install custom source package into environment

Well, now we're going to install the custom package inside your environment! I'm assuming that you've already activated your environment here, and that you have pip installed inside there (every conda environment should have it).

Firstly, navigate to the src/ directory. (The one that contains setup.py.)

Next, tell pip to install your custom package in "editable" mode.

pip install -e .

"Editable" mode will give you the ability to change the source on-the-fly, and have those changes reflected on every new invocation of Python!

Verify that the package is installed

That's it! If you want to verify that the package has been installed correctly, navigate to the project root directory, i.e. the one with notebooks/ and README.md in it. Then, run Python from there:

python

Now, try importing the hello function.

>>> from project_name import hello
>>> hello()
"Hello!"

If you get the printed "Hello!", then everything is installed correctly

Populate more source files!

You can add more source .py files in the package, to logically organize things. For example, if you have data loaders, you can add them to a project_name/data.py file, and they will then be accessible from the project_name.data namespace.

What happens if you don't use a custom project source?

Well, you end up with notebooks that are populated with lots of functions, which might need to be copied and pasted from notebook to notebook... which will lead to confusion later as you try to untangle, "which exactly was the source-of-truth version of the function??"

Or you might resort to cloning entire notebooks, and suffixing them with _v1, _v2, or _some_other_purpose... still not ideal at all.

By instead using a custom source package, you get a single source of truth for custom functions that you might have written, which you can use from notebook to notebook. (Doing so also helps with testing your code, which I hear is the purpose of this tutorial!)

You might be tempted to use your Jupyter notebooks as a source file, say, using some of fast.ai's tooling. I'd encourage you to avoid doing that too, as the simple act of hand-curating the functions that you need nudges you to think very clearly about how your project should be structured. The discipline now pays off compound interest dividends in time saved later on, especially if your project is a fairly medium-term to long-term-ish kind of project (on the order of months to years), or if you have a good hunch that the project may make it to "production" (whatever that looks like for you). Your engineering colleagues will thank you for giving a starting point that already includes a custom source library.