Get prepped per project

Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.

Firstly, some overall ideas to ground the specifics:

Some ideas pertaining to Git:

Notes that pertain to organizing files:

Notes that pertain to your compute environment:

And notes that pertain to good coding practices:

Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.

Define project-wide constants inside your custom package

Why you would want to define project-wide constants

There are some "basic facts" about a project that you might want to be able to leverage project-wide. One example of this might be data source files (CSVs, Excel spreadsheets) that you might want convenient paths to (see: Use pyprojroot to define relative paths to the project root).

How do you define project-wide constants

Assuming you have a custom source package defined (see: Place custom source code inside a lightweight package), this is not difficult at all.

Ensure that you have a constants.py, or else something named sanely, and place all of your constants in there as variables. (Paths should probably go in a paths.py file.)

Then, import the constants (or paths) into your source project anywhere you need it!

Adhere to best git practices

Why adhere to best Git practices?

Git is a unique piece of software. It does one and only one thing well: store versions of hand-curated files. Adhering to Git best practices will ensure that you use Git in its intended fashion.

What best practices should we adhere to?

The most significant point to keep in mind: only commit to Git files that you have had to create manually. That usually means version controlling:

  1. Source code. See: Place custom source code inside a lightweight package)
  2. Configuration files. See:
    1. Create runtime environment variable configuration files for each of your projects
    2. Create configuration files for code checking tools
  3. Documentation. See: Write effective documentation for your projects

There are also things you should actively avoid committing.

For specific files, you can set up a .gitignore file. See the page Set up an awesome default gitignore for your projects for more information on preventing yourself from committing them automatically.

For Jupyter notebooks, it is considered good practice to avoid committing notebooks that still have outputs. It would be best if you cleared them out using nbstripout. If you need a way to automate the cleaning of cell outputs before committing Jupyter notebooks, consider setting up pre-commit hooks using pre-commit. (See: Set up pre-commit hooks to automate checks before making Git commits)

Install code checking tools to help write better code

Why install code checking tools?

If you're writing code, then having code checking tools that automatically check for potential issues while you write the code is like having a spell and grammar checker always on. It's also like having Grammarly check your spelling, grammar, and style all the time.

Code style can drift as a project proceeds. If you work with colleagues, code style nitpicks can become a source of frustration in interactions with them. Having code style checkers that automatically flag code style that deviates from a pre-defined norm can go a long way to easing these potential conflicts.

What kind of things should I check for?

Firstly, code style formatting tools. For Python projects:

  • Use black, the uncompromising code formatting tool and don't ask questions.
  • Use isort and don't ask questions. It will sort your imports for you. You'll love the magic, I guarantee it!

You can configure black and isort using a pyproject.toml config file.

Secondly, code problems. For Python projects:

  • Use interrogate to identify which functions don't have any docstrings attached to them.
  • Use pylint to find potential code errors, like dangling or unused variables, or variables used without declaring them beforehand.
  • If starting a new codebase, get MyPy in your project ASAP. This is also a great form of documentation, which is related to Write effective documentation for your projects.

You can configure some of these tools to work with Python source files in VSCode! (see: Use VSCode to help you with software development and collaboration) For example, you can configure VSCode to format your code using black on every save, so you don't have to keep running black before committing your code.

Many authors have written tons of enthusiastic blog posts on Python code style tooling. Here's a sampling of them:

Code style and code format tooling can be a rabbit hole that you run down. Those that I have listed above should give you a great starting point.

Set up an awesome default gitignore for your projects

Why setup a "gitignore" file?

There will be some files you'll never want to commit to Git. Some include:

  • Files that contain passwords and other secrets.
  • Files that contain runtime environment variables (which themselves might be secrets).
  • Large files, such as images and binaries, unless they are essential assets. (A rule of thumb is anything >500 kb is "large" by Git standards.)
  • Jupyter notebooks that contain outputs.
  • Data file directories. (see: Never commit data into version control repositories)

If you commit them, then:

  1. Secrets and other sensitive runtime information may linger in your repository and become exposed to the world.
  2. Your repository will explode in history as changes happen to the large binary files.

How do I set up an awesome "gitignore" file?

Some believe that your .gitignore should be curated. I believe that you should use a good default one that is widely applicable. To do so, go to gitignore.io, fill in the languages and operating systems involved in your project, and copy/paste the one that fits you. If you want an awesome default one for Python:

cd /path/to/project/root
curl https://www.toptal.com/developers/gitignore/api/python

It will have .env available in there too! (see: Create runtime environment variable configuration files for each of your projects)

How is a .gitignore file parsed?

A .gitignore file is parsed according to the rules on its documentation page. It essentially follows the unix glob syntax while adding on logical modifiers. Here are a few examples to get you oriented:

Example 1: Ignore all .DS_Store files

These are files generated by macOS' Finder. You can ignore them by appending the following line to your .gitignore:

*.DS_Store

Example 2: Ignore all files under site/

If you use MkDocs to build documentation, it will place the output into the directory site/. You will want to ignore the entire directory appending the following line:

site/

Example 3: Ignore all .ipynb_checkpoints directories

If you have Jupyter notebooks inside your repository, you can ignore any path containing .ipynb_checkpoints.

.ipynb_checkpoints

Adding this line will prevent your Jupyter notebook checkpoints from being committed into your Git repository.

Use pyprojroot to define relative paths to the project root

Why you should use pyprojroot

If you follow the practice of One project should get one git repository, then everything related to the project will be housed inside that repository. Under this assumption, if you also develop a custom source code library for your project (see Place custom source code inside a lightweight package for why), then you'll likely encounter the need to find paths to things, such as data files, relative to the project root. Rather than hard-coding paths into your library and Jupyter notebooks, you can instead leverage pyprojroot to define a library of paths that are useful across the project.

How do you use pyprojroot effectively

Firstly, make sure you have an importable source_package.paths module. (I'm assuming you have written a custom source package!) In there, define project paths:

from pyprojroot import here

root = here(proj_files=[".git"])
notebooks_dir = root / "notebooks"
data_dir = root / "data"
timeseries_data_dir = data_dir / "timeseries"

here() returns a Python pathlib.Path object.

You can go as granular or as coarse-grained as you want.

Then, inside your Jupyter notebooks or Python scripts, you can import those paths as needed.

from source_package.paths import timeseries_data_dir
import pandas as pd

data = pd.read_csv(timeseries_data_dir / "2016-2019.csv")

Now, if for whatever reason you have to move the data files to a different subdirectory (say, to keep things even more organized than you already are, you awesome person!), then you just have to update one location in source_package.paths, and you're able to reference the data file in all of your scripts!

See also: Define single sources of truth for your data sources.

Keep your notebooks organized with logical categories

In my experience, there are three types of notebooks that get written.

Prototyping notebooks go under notebooks/

These notebooks are drafting grounds for "production" code. We use Jupyter notebooks as an experimentation playground. (see: Use Jupyter as an experimentation playground). They do not need to be kept running reliably/reproducibly, and essentially are considered "disposable".

If you are collaborating with colleagues on a project, you can categorize notebooks by their primary author. For example, if I am working with Lily and Arkadij on a project, we can each get our own "user spaces" in there while agreeing not to touch each other's notebooks:

project/
- notebooks/
  - lily/     # lily's notebooks go here
  - arkadij/  # arkadij's notebooks go here
  - eric/     # eric's notebooks go here

Documentation notebooks go under docs/

These notebooks are written in the original spirit of Jupyter notebooks. They combine prose, code and code-generated figures. They contain a narrative, a data story. One may say they are "production", in that someone will read them and act on them. They need to be reliably executed from top-to-bottom, usually in a continuous integration system. (see: Build a continuous integration pipeline for your source) using MkDocs and mknotebooks.

For these notebooks, we might choose to keep them in the docs/ directory:

project/
- docs/
  - some_notebook.ipynb

Application notebooks go under app/

Sometimes you might opt to use voila to build front-end applications for those whom you serve. This is a convenient option because you don't have to jump out of a Jupyter context if you're already in there. These notebooks are considered "production" as well, however because they are code embedded in JSON, they are more difficult to diff with git.

For these notebooks, you probably want to keep them in a directory named app, where anything that becomes front-facing to the clients we serve are stored:

project/
- apps/
  - notebook_app.ipynb

Build a continuous integration pipeline for your source

What is a continuous integration pipeline

If you end up writing software (see: Place custom source code inside a lightweight package), especially code that you might need to depend on in the future, having a test suite is essential (see: Write tests that test your custom code). However, the execution of the tests still needs to be triggered by you.

A continuous integration (CI) pipeline solves that problem for you. When configured correctly, on every commit you make to your codebase, it will automatically:

  1. Build an environment that you configure
  2. Execute all tests associated with your source code inside that environment

You can think of a continuous integration pipeline as a programmable bot that runs commands that you've configured it to run, except it does so automatically on every single commit.

Why write a continuous integration pipeline

You can configure a CI pipeline to automatically run code checks, thus preventing you from breaking something that you previously wrote on which you also depend.

You can also configure a CI pipeline to continuously run analyses that are crucial to the project. You essentially feed the CI pipeline the commands needed to re-run analyses that are important and deposit the results in a location that you get to configure.

If you don't build a CI pipeline, then you'll miss out on the benefits of automatically having a bot check your work for breakages.

How to build a CI pipeline

There's a myriad of CI providers. Here are a few examples:

  • Travis CI
  • Azure Pipelines
  • GitHub Actions
  • CircleCI

Because of the myriad of options available, it'd be futile to give you a tutorial. Instead, I'll show you what's common between them.

Firstly, you begin by writing a configuration file that lists out all of the build steps. Typically it's a YAML file (Travis CI, Azure Pipelines, and GitHub Actions all use this), but sometimes you'll have other formats, such as a Jenkinsfile for Jenkins. This file is, by convention, usually placed in the root of your project repository, but you can also opt to put it in another location if that helps with file organization.

Most commonly, the build steps will be nothing more than bash commands. For example, in Travis CI, each build step in the YAML file is a bash command used to execute the pipeline. Sometimes, to take advantage of the user-friendly UI elements provided by the CI provider, you'll be asked to supply a slightly more complex YAML file. There, you can group build steps into logical higher-order steps and provide human-readable descriptions for them; these get paired with a web UI that lets you easily debug a step when something goes wrong.

Secondly, there'll be a website (also known as a "control plane" in cloud jargon) where you go to configure the continuous integration bot. There, you'll typically configure:

  1. The location of the Git repository
  2. The exact configuration file(s) that contains the build steps.

If your company has set up internal systems slightly differently, you'll probably have to ask your IT department's DevOps team for help to accomplish your task. Ask nicely; they invest tons of time building out something usable, but sometimes the data scientist's level of expertise with these systems, which is usually beginner, is out of their radars.

Follow the rule of one-to-one in managing your projects

What is this rule all about

The one-to-one rule essentially means this. Each project that we work on gets:

In addition, when we name things, such as environment names, repository names, and more, we choose names that are consistent with one another (see: Sanely name things consistently for the reasons why).

Why is this important

Conventions help act as a lubricant - a shortcut for us to interact with others. Adopting the convention of one-to-one mappings helps us manage some of the complexity that may arise in a project.

Some teams have a habit of putting source code in one place (e.g. Bitbucket) and documentation in another (e.g. Confluence). I would discourage this; placing source code and documentation on how to use it next to each other is a much better way to work, because it gives you and your project stakeholders one single source of truth to find information related to a project.

When can we break this rule

A few guidelines can help you decide.

When a source repository matures enough such that you see a submodule that is generalizable beyond the project itself, then it's time to engage the help of a real software developer to refactor that chunk of code out of the source file into a separate package.

When the project matures enough such that there's a natural bifurcation in work that needs more independence from the original repository, then it's time to split the repository into two. At that point, apply the same principles to the new repository.

One project should get one git repository

Why one project should get one Git repository

This helps a ton with organization. When you have one project targeted to one Git repository, you can easily house everything related to that project in that one Git repository. I mean everything. This includes:

In doing so, you have one mental location that you can point to for everything related to a project. This is a saner way of operating than over-engineering the separation of concerns at the beginning, with docs in one place and out-of-sync with the source code in another place... you get where we're going with this point.

How to get this implemented

Easy! Create your Git repo for the project, and then start putting stuff in there :).

Enough said here!

What should you name the Git repo? See the page: Sanely name things consistently

After you have set up your Git repo, make sure to Set up your project with a sane directory structure.

Also, Set up an awesome default gitignore for your projects!

Write effective documentation for your projects

Why write documentation

As your data science project progresses, you should be documenting your work somehow inside your project. Your future self and other colleagues will need mental context to help get up-to-speed with the project. That mental context can mean the difference between staying on course or veering off in unproductive directions.

Useful documentation helps you quickly onboard collaborators to the project. By reading your documentation, you will help them get oriented and know how to get things done with your project. You won't be available forever to everyone who might come by, so your documentation effectively scales the longevity and impact of your work.

How do you write useful documentation

There are four primary types of documentation, which you can leverage at different points in time.

The first is custom source code docstrings. We write docstrings inside Python functions to document what we intend to accomplish with the code block and why that code needs to exist. Be diligent about writing down the why behind the what; it will help you recall the "what" later on.

The second is how-to guides for newcomers to the project. These guides help your newcomers get up to speed on the commands needed to set up a local development environment for their project. Essentially the sequence of terminal incantations that they would need to type to start hacking on the project. As always, for non-obvious steps, always document the why!

The third involves crafting and telling a story of the project's progress. For those of you who have done scientific research before, you'll know how this goes: it's essentially the lab meeting presentations that you deliver! Early on, your progress will be granular, but your progress will gain momentum as the project progresses. Doing this is important because the act of reflecting on prior work, summarizing, and linearizing it for yourself helps you catch logical gaps that need to be filled in, essentially identifying where you need to focus your project efforts.

The final one is the project README! The README usually exists as README.md or README.txt in the project root directory and serves a few purposes:

  1. Giving an overview of why the project exists.
  2. Providing an overview of the "rules of engagement" with the project.
  3. Serving up a "Quickstart" or "Installation" section to guide users on how to get set up.
  4. Showing an example of what they can do with the project.

What tools should we use to write documentation?

On this matter, I would advocate that we simultaneously strive to be simple and automated. For Pythonistas, there are two predominant options that you can go with: Sphinx and MkDocs.

Sphinx

At first glance, most in the Python world would advocate for the use of Sphinx, which is the stalwart package used to generate documentation. Sphinx's power lies in its syntax and ecosystem of extensions: you can easily link out to other packages, build API documentation from docstrings, run examples in documentation as tests, and more.

MkDocs

However, if you're not already familiar with Sphinx, I would recommend getting started using MkDocs. Its core design is much simpler, relying only on Markdown files as the source for documentation. That is MkDoc's most significant advantage: from my vantage point, Markdown syntax knowledge is more widespread than Sphinx syntax knowledge; hence, it's much easier to invite collaborators to write documentation together. (Hint: the MkDocs Material theme by Squidfunk has a ton of super excellent features that easily enhance MkDocs!)

What principles should we keep in mind when writing docs?

Single source of truth

Firstly, you should define a single source of truth for statements that you make in your docs. If you can, avoid copy/pasting anything. Related ideas here are written in Define single sources of truth for your data sources.

Write to the audience

Secondly, you'll want to pick from several styles of writing. One effective way is to think of it in terms of answering critical questions for a project. An example list of questions that commonly show up in data projects mirror that of a scientific research paper and include (but are not limited to):

  • What question does this project answer? What problem are you solving through this project? What is the bigger context of this project?
  • What are the data backing the project, and from where do they come? Where is the data description? (see also: Write data descriptor files for your data sources)
  • What methods were used in the project?
  • What key insights should be gained from this project?

If your project also encompasses a tool that helps routinize the project in a production setting:

  • What is the deployment strategy for the project? What pre-requisites are needed before we can "deploy" the project?
  • What code/commands need to be executed at the command line/REPL/Jupyter notebook to use the tools built in this project?
  • What are the tools available for the visualization of model results, and how ought they be interpreted?

As one of my reviewers, Simon Eng, mentioned, the overarching point is that your documentation should explain to someone else what's going on in the project.

Use semantic line breaks

Finally, it would be best if you used semantic line breaks, also known as semantic line feeds. Go ahead. I know you're curious; click on the links to learn why :).

Resources

I strongly recommend reading the Write The Docs guide to writing technical documentation.

Additionally, Admond Lee has additional reasons for writing documentation.

Configuration file overview

What configuration files go with which tools?

The following table should help with disambiguating the question.

Tool Configuration File Version Control?
black pyproject.toml
isort pyproject.toml
interrogate pyproject.toml
pylint pyproject.toml
conda environment.yml
VSCode .vscode/settings.json ⛔️
pip requirements.txt
MkDocs mkdocs.yml
git .gitignore

Use scripts to automate routine execution of tasks

This idea should be pretty obvious. If you find yourself executing the exact same commands over and over and over, you should probably put them together into a bash, Python, or R script that you can call from the root of your directory.

Where should these scripts live?

In the spirit of putting things in categorically relevant places (see: Organize your projects by leveraging categories), you should place them in the scripts/ directory, and provide additional sub-categories inside there.

How do I decide what language to write those scripts in?

You should do what feels most comfortable for you, but there are still some idiomatic guidelines that can help you make a decision:

  • If you're doing text processing of files, or otherwise leveraging functions from your project's custom source, then you might want to write them in Python. (see: Place custom source code inside a lightweight package)
  • If you're doing filesystem manipulation, or repeated serial execution of command line tools, a bash script is a great idea.

What else should I pay attention to when building these scripts?

Design for project root execution

Most of the time, it's optimal to design these scripts assuming that the "current working directory" is project root directory. This will simplify how you execute the scripts. You'll save on injecting "cd" commands into the documentation that you build.

There are exceptions to the rule. For example, if you know that every subsequent operation in the script depends on being in a subdirectory, then setting the current working directory to that subdirectory is a great idea! That age-old adage of "knowing when to break the rules judiciously" applies here.

Leverage Makefiles

If you put your scripts in a scripts/ directory, then constantly executing a command that looks like:

bash scripts/ci/build.sh

can get boring over time. If you instead put that line in a Makefile as follows:

build:
    bash scripts/ci/build.sh

then you can execute the command make build from the project root, and save yourself keystrokes.

Help your colleagues with a "bootstrap" script

You can help your colleagues get setup by creating a script for them! For example, you can write one that has the following commands:

# ./scripts/setup.sh

export PROJECT_ENV_NAME = ______________  # replace with your env name
conda env create -f environment.yml || mamba env create -f $PROJECT_ENV_NAME
conda activate $PROJECT_ENV_NAME

# Install custom source
pip install -e .

# Install Jupyter extensions (if relevant)
jupyter labextension install @jupyter-widgets/jupyterlab-manager

# Install pre-commit hooks
pre-commit install
echo "Setup complete! In the future, run 'conda activate $PROJECT_ENV_NAME' before you run your notebooks."

This script will help you:

  1. Create the conda environment. (see: Create one conda environment per project)
  2. Install the custom source
  3. Install the Jupyterlab IPywidgets extension (necessary for progress bars like tqdm!)
  4. Install pre-commit hooks (see: Set up pre-commit hooks to automate checks before making git commits)

Saves a bunch of time downstream!

Separate computationally expensive steps from computationally cheap steps

If a script is part of a pipeline (see: Build your projects thinking in terms of pipelines), then ensure that you have it set up such that upstream computational steps, especially those that are computationally expensive, execute independent of computationally cheap ones that depend on them. One example, provided by one of my reviewers Simon, is "intermediate data generation" vs. "data visualization". To quote:

I run under the philosophy of not unnecessarily regenerating data. Having to regenerate data -- especially if takes a long time -- just to regenerate a visualization absolutely sucks and is a common cause of my annoyance when my underlings present data in meetings.

Write tests that test your custom code

Why write tests for your code

Writing tests for your code is a great practice. If you depend on a chunk of code, you should write tests for it.

As you develop a codebase, you might inadvertently modify an existing piece of code on which your project depends. This modification will break other analyses that rely on that piece of code. Writing tests that get automatically executed on every commit (see: Build a continuous integration pipeline for your source) will help you catch these changes before you merge them into your codebase.

How do you write tests

I could write a full-fledged testing tutorial, but because the intent here is to provide you with the "why"s followed by a quick guide, I would recommend reading an essay I wrote on this.

The general pattern to look out for is that:

  1. You first write a function that doesn't merely wrap another function but does a single unit of substantial work.
  2. Then, you test the function using examples that you provide to the test runner, i.e. the program that automatically finds all tests to run and then executes them.

In terms of test runners, I find pytest to be the fastest to get up and running with; through experience, I have also found it well-equipped to grow in complexity if my codebase necessitates it.

Get bootstrapped on your data science projects

Why this knowledge base exists

I'm super glad you made it to my knowledge base on bootstrapping your data science machine - otherwise known as getting set up for success with great organization and sane structure. The content inside here has been battle-tested through real-world experience with colleagues and others skilled in their computing domains, but a bit new to the modern tooling offered to us in the data science world.

This knowledge base exists because I want to encourage more data scientists to adopt sane practices and conventions that promote collaboration and reproducibility in our data work. These are practices that, through years of practice in developing data projects and open source software, I have come to see the importance of.

Where I think you, the reader, are coming from

The most important thing I'm assuming about you, the reader, is that you have experienced the same challenges I encountered when structure and workflow were absent from my work. I wrote down this knowledge base for your benefit. Based on my seven years (as of 2020) of continual refinement, you'll learn how to:

  1. structure your computer for data analysis projects, and
  2. structure a data analysis project for maximum effectiveness.

Because I'm a Pythonista who uses Jupyter and VSCode, some tips are specific to the language and these tools. However, being a Python programmer isn't a hard requirement. More than the specifics, I hope this knowledge base imparts to you a particular philosophy of how to work. That philosophy should be portable across languages and tooling, though having specific tooling can sometimes help you adhere to the philosophy. To read more about the philosophies behind this knowledge base, check out the page: The philosophies that ground the bootstrap.

For the beginner

As you grow in your knowledge and skillsets, this knowledge base should help you keep an eye out for critical topics you might want to learn.

For the moderately experienced

If you're looking to refine your skillsets, this knowledge graph should give you the base from which you dive deeper into specific tools.

For the seasoned data scientist

If you're a seasoned data practitioner, this guide should be able to serve you the way it helps me: as a cookbook/recipe guide to remind you of things when you forget them.

Things you'll learn

The things you'll learn here cover the first steps, starting at configuring your laptop or workstation for data science development up to some practices that help you organize your projects, regardless of where you do your computing.

I have a recommended order below, based on my experience with colleagues and other newcomers to a project:

  1. Configure your machine
  2. Get prepped per project
  3. Navigate the packaging world
  4. Handling data
  5. Choose and customize your development environment

However, you may wish to follow the guide differently and not read it in the way I prescribed above. That's not a problem! The online version is intentionally structured as a knowledge graph and not a book so that you can explore it on your own terms.

Apply these ideas just-in-time

As you go through this content, I would also encourage you to keep in mind: Time will distill the best practices in your context. Don't feel pressured to apply every single thing you see here to your project. Incrementally adopt these practices as they make sense. They're all quite composable with one another.

Not everything written here is applicable to every single project. Indeed, rarely do I use 100% of everything I've written here. Sometimes, my projects end up being more software tool development oriented, and hence I use a lot of the software-oriented ideas. Sometimes my projects are one-off, and so I ease off on the reproducibility aspect. Most of the time, my projects require a lot of exploratory work beyond simple exploratory data analysis, and imposing structure early on can be stifling for the project.

So rather than see this collection of notes as something that we must impose on every project, I would encourage you to be picky and choosy, and use only what helps you, in a just-in-time fashion, to increase your effectiveness in a project. Just-in-time adoption of a practice or a tool is preferable, because doing so eases the pressure to be rigidly complete from the get-go. In my own work, I incorporate a practice into the project just-in-time as I sense the need for it.

Moreover, as my colleague Zachary Barry would say, none of these practices can be mastered overnight. It takes running into walls to appreciate why these practices are important. For an individual who has not yet encountered problems with disorganized code, multiple versions of the same dataset, and other issues I describe here, it is difficult to deeply appreciate why it matters to apply simple and basic software development practices to your data science work. So I would encourage you to use this knowledge base as a reference tool that helps you find out, in a just-in-time fashion, a practice or tool that helps you solve a problem.

Ways to support the project

If you wish to support the project, there are a few ways:

Firstly, I spent some time linearizing this content based on my experience guiding skilled newcomers to the DS world. That's available on the eBook version on LeanPub. If you purchase a copy (when it's released), you will get instructions to access the repository that houses the book source and automation to bootstrap each of your Python data science projects easily!

Secondly, you can support my data science education work on Patreon! My supporters get early access to the data science content that I make. Including a free copy of the eBook, which has bonus content in there!(Special thanks goes to my Supporters!)

Finally, if you have a question regarding the content, please feel free to reach out on Shortwhale. (If I make substantial edits on the basis of your comments or questions, I might reach out to you to offer a free copy of the eBook!)

index

Why this knowledge base exists

I'm super glad you made it to my knowledge base on bootstrapping your data science machine - otherwise known as getting set up for success with great organization and sane structure. The content inside here has been battle-tested through real-world experience with colleagues and others skilled in their computing domains, but a bit new to the modern tooling offered to us in the data science world.

This knowledge base exists because I want to encourage more data scientists to adopt sane practices and conventions that promote collaboration and reproducibility in our data work. These are practices that, through years of practice in developing data projects and open source software, I have come to see the importance of.

Where I think you, the reader, are coming from

The most important thing I'm assuming about you, the reader, is that you have experienced the same challenges I encountered when structure and workflow were absent from my work. I wrote down this knowledge base for your benefit. Based on my seven years (as of 2020) of continual refinement, you'll learn how to:

  1. structure your computer for data analysis projects, and
  2. structure a data analysis project for maximum effectiveness.

Because I'm a Pythonista who uses Jupyter and VSCode, some tips are specific to the language and these tools. However, being a Python programmer isn't a hard requirement. More than the specifics, I hope this knowledge base imparts to you a particular philosophy of how to work. That philosophy should be portable across languages and tooling, though having specific tooling can sometimes help you adhere to the philosophy. To read more about the philosophies behind this knowledge base, check out the page: The philosophies that ground the bootstrap.

For the beginner

As you grow in your knowledge and skillsets, this knowledge base should help you keep an eye out for critical topics you might want to learn.

For the moderately experienced

If you're looking to refine your skillsets, this knowledge graph should give you the base from which you dive deeper into specific tools.

For the seasoned data scientist

If you're a seasoned data practitioner, this guide should be able to serve you the way it helps me: as a cookbook/recipe guide to remind you of things when you forget them.

Things you'll learn

The things you'll learn here cover the first steps, starting at configuring your laptop or workstation for data science development up to some practices that help you organize your projects, regardless of where you do your computing.

I have a recommended order below, based on my experience with colleagues and other newcomers to a project:

  1. Configure your machine
  2. Get prepped per project
  3. Navigate the packaging world
  4. Handling data
  5. Choose and customize your development environment

However, you may wish to follow the guide differently and not read it in the way I prescribed above. That's not a problem! The online version is intentionally structured as a knowledge graph and not a book so that you can explore it on your own terms.

Apply these ideas just-in-time

As you go through this content, I would also encourage you to keep in mind: Time will distill the best practices in your context. Don't feel pressured to apply every single thing you see here to your project. Incrementally adopt these practices as they make sense. They're all quite composable with one another.

Not everything written here is applicable to every single project. Indeed, rarely do I use 100% of everything I've written here. Sometimes, my projects end up being more software tool development oriented, and hence I use a lot of the software-oriented ideas. Sometimes my projects are one-off, and so I ease off on the reproducibility aspect. Most of the time, my projects require a lot of exploratory work beyond simple exploratory data analysis, and imposing structure early on can be stifling for the project.

So rather than see this collection of notes as something that we must impose on every project, I would encourage you to be picky and choosy, and use only what helps you, in a just-in-time fashion, to increase your effectiveness in a project. Just-in-time adoption of a practice or a tool is preferable, because doing so eases the pressure to be rigidly complete from the get-go. In my own work, I incorporate a practice into the project just-in-time as I sense the need for it.

Moreover, as my colleague Zachary Barry would say, none of these practices can be mastered overnight. It takes running into walls to appreciate why these practices are important. For an individual who has not yet encountered problems with disorganized code, multiple versions of the same dataset, and other issues I describe here, it is difficult to deeply appreciate why it matters to apply simple and basic software development practices to your data science work. So I would encourage you to use this knowledge base as a reference tool that helps you find out, in a just-in-time fashion, a practice or tool that helps you solve a problem.

Ways to support the project

If you wish to support the project, there are a few ways:

Firstly, I spent some time linearizing this content based on my experience guiding skilled newcomers to the DS world. That's available on the eBook version on LeanPub. If you purchase a copy (when it's released), you will get instructions to access the repository that houses the book source and automation to bootstrap each of your Python data science projects easily!

Secondly, you can support my data science education work on Patreon! My supporters get early access to the data science content that I make. Including a free copy of the eBook, which has bonus content in there!(Special thanks goes to my Supporters!)

Finally, if you have a question regarding the content, please feel free to reach out on Shortwhale. (If I make substantial edits on the basis of your comments or questions, I might reach out to you to offer a free copy of the eBook!)

Create runtime environment variable configuration files for each of your projects

Why configure environment variables per project

When you work on your projects, one assumption you will usually have is that your development environment will look like your project's runtime environment with all of its environment variables. The runtime environment is usually your "production" setting: a web app or API, a model in a pipeline, or a software package that gets distributed. (For more on environment variables, see: Take full control of your shell environment variables)

How to configure environment variables for your project

Here, I'm assuming that you follow the practice of One project should get one git repository and that you Use pyprojroot to define relative paths to the project root.

To configure environment variables for your project, a recommended practice is to create a .env file in your project's root directory, which stores your environment variables as such:

export ENV_VAR_1 = "some_value"
export DATABASE_CONNECTION_STRING = "some_database_connection_string"
export ENV_VAR_3 = "some_other_value"

We use the export syntax here because we can, in our shells, run the command source .env and have the environment variables defined in there applied to our environment.

Now, if you're using a Python project, make sure you have the package python-dotenv (Github repo here) installed in the conda environment. Then, in your Python .py source files:

from dotenv import load_dotenv
from pyprojroot import here
import os

dotenv_path = here() / ".env"
load_dotenv(dotenv_path=dotenv_path)  # this will load the .env file in your project directory root.

# Now, get the environment variable.
DATABASE_CONNECTION_STRING = os.getenv("DATABASE_CONNECTION_STRING")

In this way, your runtime environment variables get loaded into the runtime environment, and become available to all child processes started from within the shell (e.g. Jupyter Lab, or Python, etc.).

Always gitignore your .env file

Your .env file might contain some sensitive secrets. You should always ensure that your .gitignore file contains .env in it.

See also: Set up an awesome default gitignore for your projects

Set up your project with a sane directory structure

Why setup your project with a sane directory structure

Doing so will help you quickly and easily find things. This is crucial when navigating your data project. If you don't do so, you will likely end up being utterly confused as to where things are located.

What does a sane directory look like

I am going to show you one particular example, but you can adapt it to however you like.

|- informative-project-name-here/
   |- data/          # never add anything here into source control
   |- notebooks/     # divide by usernames if needed
   |- scripts/       # basically for automation
   |- src/           # custom source code
      |- importable_name/
         |- __init__.py
         |-...
      |- tests/      # test suite
   |- README.md
   |-...

The purpose of each directory is annotated in each line. That said, you can find relevant information in the following pages:

Create configuration files for code checking tools

Why configure code checking tools using configuration files?

Configuration files give you the ability to declare your project's preferred configuration and distribute it to all participants in the project. It smooths out the practice of data science (and software development too), as these configuration represent the declared normative state of a project, answering questions such as:

  • What code style rules ought we adhere to?
  • How much docstring coverage ought to be present?
  • What software tests should be run?

Without these configuration files declaring how code checkers ought to behave, we leave it up to collaborators and contributors to manually configure their local systems, and without sufficient documentation, they may bug you over and over on how things ought to be configured. This increase in friction will inevitably lead to an increase in frustration with the project, and hence a decrease in engagement.

As such, you can think of these configuration files as part of your automation toolkit, thus satisfying the automation philosophy.

What configuration files belong with which code checking tools?

Because configuration files are so crucial to a project, I have collated them together on the Configuration file overview page.

When do I create these configuration files?

As always, just-in-time, at the moment that you need it.

Place custom source code inside a lightweight package

Why write a package for your custom source code

Have you encountered the situation where you create a new notebook, and then promptly copy code verbatim from another notebook with zero modifications?

As you as you did that, you created two sources of truth for that one function.

Now... if you intended to modify the function and test the effect of the modification on the rest of the code, then you still could have done better.

A custom source package that is installed into the conda environment that you have set up will help you refactor code out of the notebook, and hence help you define one source of truth for the entire function, which you can then import anywhere.

How to create a custom source package for a project

Firstly, I'm assuming you are following the ideas laid out in Set up your project with a sane directory structure. Specifically, you have a src/ directory under the project root. Here, I'm going to give you a summary of the official Python packaging tutorial.

In your project src/ directory, ensure you have a few files:

|- src/
   |- setup.py
   |- source_package/  # rename this to the same name as the conda environment
      |- data/         # for all data-related functions
         |- loaders.py # convenience functions for loading data
         |- schemas.py # this is for pandera schemas
      |- __init__.py   # this is necessary
      |- paths.py      # this is for path definitionsme
      |- utils.py      # utiity functions that you might need
      |- ...
   |- tests/
      |- test_utils.py # tests for utility functions
      |- ...

If you're wondering about why we name the source package the same name as our conda environment, it's for consistency purposes. (see: Sanely name things consistently)

If you're wondering about the purpose of paths.py, read this page: Use pyprojroot to define relative paths to the project root

setup.py should look like this:

import setuptools

with open("README.md", "r", encoding="utf-8") as fh:
    long_description = fh.read()

setuptools.setup(
    name="source_package", # Replace with your environment name
    version="0.1",         # Replace with anything that you need
    packages=setuptools.find_packages(),
)

Now, you activate the environment dedicated to your project (see: Create one conda environment per project) and install the custom source package:

conda activate project_environment
cd src
pip install -e .

This will install the source package in development mode. As you continue to add more code into the custom source package, they will be instantly available to you project-wide.

Now, in your projects, you can import anything from the custom source package.

Note: If you've read the official Python documentation on packages, you might see that src/ has nothing special in its name. (Indeed, one of my reviewers, Arkadij Kummer, pointed this out to me.) Having tried to organize a few ways, I think having src/ is better for DS projects than having the setup.py file and source_package/ directory in the top-level project directory. Those two are better isolated from the rest of the project and we can keep the setup.py in src/ too, thus eliminating clutter from the top-level directory.

How often should the package be updated?

As often as you need it!

Also, I would encourage you to avoid releasing the package standalone until you know that it ought to be used as a standalone Python package. Otherwise, you might prematurely bring upon yourself a maintenance burden!

Create one conda environment per project

Why use one conda environment per project

If you have multiple projects that you work on, but you install all project dependencies into a shared environment, then I guarantee you that at some point, you will run into dependency conflicts as you try to upgrade/update packages to try out new things.

"So what?" you might ask. Well, you'll end up breaking your code! Take this word of advice from someone who has had to deal with the consequences of having his code not working in one project even as code in another does. And finding out one day before an important presentation, and you have to put out figures. The horror!

You will want to ensure that you have an isolated conda environment for each project to keep your projects insulated from one another.

How do you set up your conda environment files

Here is a baseline that you can copy and modify at any time.

name: project  ## CHANGE THIS TO YOUR ACTUAL PROJECT
channels:      ## Add any other channels below if necessary
- conda-forge
dependencies:  ## Prioritize conda packages
- python=3.8
- jupyter
- conda
- mamba
- ipython
- ipykernel
- numpy
- matplotlib
- scipy
- pandas
- pip
- pre-commit
- black
- nbstripout
- mypy
- flake8
- pycodestyle
- pydocstyle
- pytest
- pytest-cov
- pytest-xdist
- pip:  ## Add in pip packages if necessary
  - mkdocs
  - mkdocs-material
  - mkdocstrings
  - mknotebooks

If a package exists in both conda-forge and pip and you rely primarily on conda, then I recommend prioritizing the conda package over the pip package. The advantage here is that conda's dependency solver can grab the latest compatible version without worrying about pip clobbering over other dependencies. (h/t my reviewer Simon, who pointed out that newer versions of pip have a dependency solver, though as far as possible, staying consistent is preferable, though mixing-and-matching is alright if you know what you're doing.)

This baseline helps me bootstrap conda environments. The packages that are in there each serve a purpose. You can read more about them on the page: Install code checking tools to help write better code.

How do you decide which versions of packages to use?

Initially, I only specify the version of Python I want, and allow the conda package manager to solve the environment.

However, there may come a time when a new package version brings a new capability. That is when you may wish to pin the version of that particular package to be at the minimum that version. (See below for the syntax needed to pin a version.) At the same time, the new package version may break compatibility -- in this case, you will want to pin it to a maximum package version.

It's not always obvious, though, so be sure to use version control

If you wish, you can also pin versions to a minimum, maximum, or specific one, using version modifiers.

  • For conda, they are >, >=, =, <= and <. (You should be able to grok what is what!)
  • For pip, they are >, >=, ==, <= and <. (Note: for pip, it is double equals == and not single equals =.)

So when do you use each of the modifiers?

  • Use =/== sparingly while in development: you will be stuck with a particular version and will find it difficult to update other packages together.
  • Use <= and < to prevent conda/pip from upgrading a package beyond a certain version. This can be helpful if new versions of packages you rely on have breaking API changes.
  • Use >= and > to prevent conda/pip from installing a package below a certain version. This is helpful if you've come to depend on breaking API changes from older versions.

When do you upgrade/install new packages?

Upgrading and/or installing packages should be done on an as-needed basis. There are two paths to do upgrade packages that I have found:

The principled way

The principled way to do an upgrade is to first pin the version inside environment.yml, and then use the following command to update the environment:

conda env update -f environment.yml

The hacky way

The hacky way to do the upgrade is to directly conda or pip install the package, and then add it (or modify its version) in the environment.yml file. Do this only if you know what you're doing!

Ensure your environment kernels are available to Jupyter

By practicing "one project gets one environment", then ensuring that those environments' Python interpreters are available to Jupyter is going to be crucial. If you find that your project's environment Python is unavailable, then you'll need to ensure that it's available. To do so, ensure that the Python environment has the package ipykernel. (If not, install it by hand and add it to the environment.yml file.) Then, run the following command:

# assuming you have already activated your environment,
# replace $ENVIRONMENT_NAME with your environment's name.
python -m ipykernel install --user --name $ENVIRONMENT_NAME

Now, it will show up as a "kernel" for executing Python code in your Jupyter notebooks. (see Configure Jupyter and Jupyter Lab for more information on how to configure it.)

Further tips

Now, how should you name your conda environment? See the page: Sanely name things consistently!

Sanely name things consistently

Why should you name things consistently

Think about the following scenario:

  • Your project is called sales-forecast
  • It lives in a Git repository hosted on GitHub called forecast-2020
  • Your conda environment is named something you copied and pasted from a tutorial, say my_env
  • Your custom source code is named my_source.

Are you going to be able to ever mentally map them to one another? Probably not, though maybe if you did put in the effort to do so, you might be able to. That said, if you work with someone else on the project, you're only going to increase the amount of mental work they need to do to keep things straight.

Now, consider a different scenario:

  • Your project is called Sales Forecast 2020
  • Your Git repository is called sales-forecast-2020
  • Your conda environment is called sales-forecast-2020-env
  • And your custom source code package is called sales_forecast_2020.

Does the latter seem saner? I think so too :).

What constitutes a "sane" name?

I think the following guidelines help:

  1. 2 words are preferred, 3 words are okay, 4 is bordering on verbose; 5 or more words is not really acceptable.
  2. Explicit, precise, and well-defined for a "local" scope, where "local" depends on your definition.

I would add that learning how to name things precisely in English, and hence provide precise variable names in Python code, is a great way for English second language speakers to practice and expand their language vocabulary.

As one of my reviewers (Logan Thomas) pointed out, leveraging the name to help newcomers distinguish between entities is helpful too. For this reason, your environment can be suffixed with a consistent noun; for example, I have -dev as a suffix to make for software package-oriented projects; above, we used -env as a suffix (making sales-forecast-2020-env) to indicate to a newcomer that we're activating an environment when we conda activate sales-forecast-2020-env. As long as you're consistent, that's not a problem!

Set up pre-commit hooks to automate checks before making git commits

Why use pre-commit hooks?

One way to prevent yourself from committing code that is not properly checked is to use pre-commit hooks. This is a feature of Git that allows you to automatically run checks before they are committed to the repository history. Because they are automatically run, you set them up once, usually when you first download the repository, and no longer need to think about them again.

How do I set up pre-commit hooks?

You can install the pre-commit framework, which lets you easily configure pre-commit hooks to run.

The gist of the installation steps are in the bash commands below, but you should read the website for a fuller understanding.

conda install -c conda-forge pre-commit
pre-commit sample-config > .pre-commit-config.yaml

Now, go and edit .pre-commit-config.yaml -- add other pre-commit checks, for example. (See below for an example that you can use.) Then, run:

pre-commit install
pre-commit run --all-files   # run the checks against all of your files

What pre-commit hooks are good to install?

a.k.a. what would you put in your .pre-commit-config.yaml? Here's a sane collection of starter things that I usually include, taken from my Network Analysis Made Simple repository.

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v2.3.0
    hooks:
      - id: check-yaml
      - id: end-of-file-fixer
      - id: trailing-whitespace
  - repo: https://github.com/psf/black
    rev: 19.3b0
    hooks:
      - id: black
  - repo: https://github.com/kynan/nbstripout
    rev: master
    hooks:
      - id: nbstripout
        files: ".ipynb"

nbstripout is a super important one -- it ensures that all of my notebook outputs are stripped before committing them to the repository! (Otherwise, you'll end up bloating your repository with large notebooks.)

How does this relate to continuous integration pipeline checks?

(For a refresher, or if you're not sure what CI pipeline checks are, see Build a continuous integration pipeline for your source.)

CI pipeline checks are also a form of automated checks that you can put into your workflow. Ideally, everything that is checked for in your pre-commit hooks should be checked for in your CI pipeline.

So what's the difference, then? Here's my thoughts on this:

In pre-commit hooks, you generally run the lightweight checks: the ones that are annoying to run manually all the time but also execute very quickly. Things like code style checks, for example, or those that ensure there are only single trailing lines in text files.

In the CI system, you run those checks in addition to the longer-running test suite. (see: Write tests that test your custom code). So the CI system behaves as a backup to the pre-commit hooks.