Data Science Bootstrap

Install code checking tools to help write better code

Why install code checking tools?

If you're writing code, then having code checking tools that automatically check for potential issues while you write the code is like having a spell and grammar checker always on. It's also like having Grammarly check your spelling, grammar, and style all the time.

Code style can drift as a project proceeds. If you work with colleagues, code style nitpicks can become a source of frustration in interactions with them. Having code style checkers that automatically flag code style that deviates from a pre-defined norm can go a long way to easing these potential conflicts.

What kind of things should I check for?

Firstly, code style formatting tools. For Python projects:

Use black, the uncompromising code formatting tool and don't ask questions.
Use isort and don't ask questions. It will sort your imports for you. You'll love the magic, I guarantee it!

You can configure black and isort using a pyproject.toml config file.

Secondly, code problems. For Python projects:

Use interrogate to identify which functions don't have any docstrings attached to them.
Use pylint to find potential code errors, like dangling or unused variables, or variables used without declaring them beforehand.
If starting a new codebase, get MyPy in your project ASAP. This is also a great form of documentation, which is related to Write effective documentation for your projects.

You can configure some of these tools to work with Python source files in VSCode! (see: Use VSCode to help you with software development and collaboration) For example, you can configure VSCode to format your code using black on every save, so you don't have to keep running black before committing your code.

Many authors have written tons of enthusiastic blog posts on Python code style tooling. Here's a sampling of them:

My essay on code formatting
My colleague Richard Quinn has a terrific essay on his favourite Python code style/format tools
Tips from Itamar Turner-Trauring on using pylint correctly

Code style and code format tooling can be a rabbit hole that you run down. Those that I have listed above should give you a great starting point.

Pages that link here

Create one conda environment per project
Why use one conda environment per project If you have multiple projects that you work on, but you install all project dependencies into a shared environment, then I guarantee you that at some point, you will run into dependency conflicts as you try to upgrade/update packages to try out new things

Get prepped per project
Treat your projects as if they were software projects for maximum organizational effectiveness

Create one conda environment per project

Why use one conda environment per project

If you have multiple projects that you work on, but you install all project dependencies into a shared environment, then I guarantee you that at some point, you will run into dependency conflicts as you try to upgrade/update packages to try out new things.

"So what?" you might ask. Well, you'll end up breaking your code! Take this word of advice from someone who has had to deal with the consequences of having his code not working in one project even as code in another does. And finding out one day before an important presentation, right when you need to put in new versions of figures that were made before. The horror!

You will want to ensure that you have an isolated conda environment for each project to keep your projects insulated from one another.

How do you set up your conda environment files

Here is a baseline that you can copy and modify at any time.

name: project-name-goes-here  ## CHANGE THIS TO YOUR ACTUAL PROJECT
channels:      ## Add any other channels below if necessary
- conda-forge
dependencies:  ## Prioritize conda packages
- python=3.10
- jupyter
- conda
- mamba
- ipython
- ipykernel
- numpy
- matplotlib
- scipy
- pandas
- pip
- pre-commit
- black
- nbstripout
- mypy
- flake8
- pycodestyle
- pydocstyle
- pytest
- pytest-cov
- pytest-xdist
- pip:  ## Add in pip packages if necessary
  - mkdocs
  - mkdocs-material
  - mkdocstrings
  - mknotebooks

If a package exists in both conda-forge and pip and you rely primarily on conda, then I recommend prioritizing the conda package over the pip package. The advantage here is that conda's dependency solver can grab the latest compatible version without worrying about pip clobbering over other dependencies. (h/t my reviewer Simon, who pointed out that newer versions of pip have a dependency solver, though as far as possible, staying consistent is preferable, though mixing-and-matching is alright if you know what you're doing.)

This baseline helps me bootstrap conda environments. The packages that are in there each serve a purpose. You can read more about them on the page: Install code checking tools to help write better code.

How do you decide which versions of packages to use?

Initially, I only specify the version of Python I want, and allow the conda package manager to solve the environment.

However, there may come a time when a new package version brings a new capability. That is when you may wish to pin the version of that particular package to be at the minimum that version. (See below for the syntax needed to pin a version.) At the same time, the new package version may break compatibility -- in this case, you will want to pin it to a maximum package version.

It's not always obvious, though, so be sure to use version control

If you wish, you can also pin versions to a minimum, maximum, or specific one, using version modifiers.

For conda, they are >, >=, =, <= and <. (You should be able to grok what is what!)
For pip, they are >, >=, ==, <= and <. (Note: for pip, it is double equals == and not single equals =.)

So when do you use each of the modifiers?

Use =/== sparingly while in development: you will be stuck with a particular version and will find it difficult to update other packages together.
Use <= and < to prevent conda/pip from upgrading a package beyond a certain version. This can be helpful if new versions of packages you rely on have breaking API changes.
Use >= and > to prevent conda/pip from installing a package below a certain version. This is helpful if you've come to depend on breaking API changes from older versions.

When do you upgrade/install new packages?

Upgrading and/or installing packages should be done on an as-needed basis. There are two paths to do upgrade packages that I have found:

The principled way

The principled way to do an upgrade is to first pin the version inside environment.yml, and then use the following command to update the environment:

conda env update -f environment.yml

The hacky way

The hacky way to do the upgrade is to directly conda or pip install the package, and then add it (or modify its version) in the environment.yml file. Do this only if you know what you're doing!

Ensure your environment kernels are available to Jupyter

By practicing "one project gets one environment", then ensuring that those environments' Python interpreters are available to Jupyter is going to be crucial. If you find that your project's environment Python is unavailable, then you'll need to ensure that it's available. To do so, ensure that the Python environment has the package ipykernel. (If not, install it by hand and add it to the environment.yml file.) Then, run the following command:

# assuming you have already activated your environment,
# replace $ENVIRONMENT_NAME with your environment's name.
python -m ipykernel install --user --name $ENVIRONMENT_NAME

Now, it will show up as a "kernel" for executing Python code in your Jupyter notebooks. (see Configure Jupyter and Jupyter Lab for more information on how to configure it.)

Further tips

Now, how should you name your conda environment? See the page: Sanely name things consistently!

Write effective documentation for your projects

Why write documentation

As your data science project progresses, you should be documenting your work somehow inside your project. Your future self and other colleagues will need mental context to help get up-to-speed with the project. That mental context can mean the difference between staying on course or veering off in unproductive directions.

Useful documentation helps you quickly onboard collaborators to the project. By reading your documentation, you will help them get oriented and know how to get things done with your project. You won't be available forever to everyone who might come by, so your documentation effectively scales the longevity and impact of your work.

How do you write useful documentation

To write effective documentation, we first need to recognize that there are actually four types of documentation. They are, respectively:

Tutorials
How-to Guides
Explanations
References

This is not a new concept, it is actually well-documented (ahem!) in the Diataxis Framework.

Concretely, here are some kinds of documentation that you will want to focus on.

The first is custom source code docstrings (a type of Reference). We write docstrings inside Python functions to document what we intend to accomplish with the code block and why that code needs to exist. Be diligent about writing down the why behind the what; it will help you recall the "what" later on.

The second is how-to guides for newcomers to the project (obviously under the How-to Guides category). These guides help your newcomers get up to speed on the commands needed to set up a local development environment for their project. Essentially the sequence of terminal incantations that they would need to type to start hacking on the project. As always, for non-obvious steps, always document the why!

The third involves crafting and telling a story of the project's progress. (We may consider this to be an Explanation-style documentation). For those of you who have done scientific research before, you'll know how this goes: it's essentially the lab meeting presentations that you deliver! Early on, your progress will be granular, but your progress will gain momentum as the project progresses. Doing this is important because the act of reflecting on prior work, summarizing, and linearizing it for yourself helps you catch logical gaps that need to be filled in, essentially identifying where you need to focus your project efforts.

The final one is the project README! The README usually exists as README.md or README.txt in the project root directory and serves a few purposes:

Giving an overview of why the project exists.
Providing an overview of the "rules of engagement" with the project.
Serving up a "Quickstart" or "Installation" section to guide users on how to get set up.
Showing an example of what they can do with the project.

The README file usually serves dual-purposes, both as a quick Tutorial and How-to Guide.

What tools should we use to write documentation?

On this matter, I would advocate that we simultaneously strive to be simple and automated. For Pythonistas, there are two predominant options that you can go with: Sphinx and MkDocs.

Sphinx

At first glance, most in the Python world would advocate for the use of Sphinx, which is the stalwart package used to generate documentation. Sphinx's power lies in its syntax and ecosystem of extensions: you can easily link out to other packages, build API documentation from docstrings, run examples in documentation as tests, and more.

MkDocs

However, if you're not already familiar with Sphinx, I would recommend getting started using MkDocs. Its core design is much simpler, relying only on Markdown files as the source for documentation. That is MkDoc's most significant advantage: from my vantage point, Markdown syntax knowledge is more widespread than Sphinx syntax knowledge; hence, it's much easier to invite collaborators to write documentation together. (Hint: the MkDocs Material theme by Squidfunk has a ton of super excellent features that easily enhance MkDocs!)

What principles should we keep in mind when writing docs?

Single source of truth

Firstly, you should define a single source of truth for statements that you make in your docs. If you can, avoid copy/pasting anything. Related ideas here are written in Define single sources of truth for your data sources.

Write to the audience

Secondly, you'll want to pick from several styles of writing. One effective way is to think of it in terms of answering critical questions for a project. An example list of questions that commonly show up in data projects mirror that of a scientific research paper and include (but are not limited to):

What question does this project answer? What problem are you solving through this project? What is the bigger context of this project?
What are the data backing the project, and from where do they come? Where is the data description? (see also: Write data descriptor files for your data sources)
What methods were used in the project?
What key insights should be gained from this project?

If your project also encompasses a tool that helps routinize the project in a production setting:

What is the deployment strategy for the project? What pre-requisites are needed before we can "deploy" the project?
What code/commands need to be executed at the command line/REPL/Jupyter notebook to use the tools built in this project?
What are the tools available for the visualization of model results, and how ought they be interpreted?

As one of my reviewers, Simon Eng, mentioned, the overarching point is that your documentation should explain to someone else what's going on in the project.

Use semantic line breaks

Finally, it would be best if you used semantic line breaks, also known as semantic line feeds. Go ahead. I know you're curious; click on the links to learn why :).

Resources

I strongly recommend reading the Write The Docs guide to writing technical documentation.

Additionally, Admond Lee has additional reasons for writing documentation.