Follow the rule of one-to-one in managing your projects
The one-to-one rule essentially means this. Each project that we work on gets:
In addition, when we name things, such as environment names, repository names, and more, we choose names that are consistent with one another (see: Sanely name things consistently for the reasons why).
Conventions help act as a lubricant - a shortcut for us to interact with others. Adopting the convention of one-to-one mappings helps us manage some of the complexity that may arise in a project.
Some teams have a habit of putting source code in one place (e.g. Bitbucket) and documentation in another (e.g. Confluence). I would discourage this; placing source code and documentation on how to use it next to each other is a much better way to work, because it gives you and your project stakeholders one single source of truth to find information related to a project.
A few guidelines can help you decide.
When a source repository matures enough such that you see a submodule that is generalizable beyond the project itself, then it's time to engage the help of a real software developer to refactor that chunk of code out of the source file into a separate package.
When the project matures enough such that there's a natural bifurcation in work that needs more independence from the original repository, then it's time to split the repository into two. At that point, apply the same principles to the new repository.
Get prepped per project
Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.
Firstly, some overall ideas to ground the specifics:
Some ideas pertaining to Git:
Notes that pertain to organizing files:
Notes that pertain to your compute environment:
And notes that pertain to good coding practices:
Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.
Create one conda environment per project
If you have multiple projects that you work on, but you install all project dependencies into a shared environment, then I guarantee you that at some point, you will run into dependency conflicts as you try to upgrade/update packages to try out new things.
"So what?" you might ask. Well, you'll end up breaking your code! Take this word of advice from someone who has had to deal with the consequences of having his code not working in one project even as code in another does. And finding out one day before an important presentation, right when you need to put in new versions of figures that were made before. The horror!
You will want to ensure that you have an isolated conda environment for each project to keep your projects insulated from one another.
Here is a baseline that you can copy and modify at any time.
name: project-name-goes-here ## CHANGE THIS TO YOUR ACTUAL PROJECT
channels: ## Add any other channels below if necessary
- conda-forge
dependencies: ## Prioritize conda packages
- python=3.10
- jupyter
- conda
- mamba
- ipython
- ipykernel
- numpy
- matplotlib
- scipy
- pandas
- pip
- pre-commit
- black
- nbstripout
- mypy
- flake8
- pycodestyle
- pydocstyle
- pytest
- pytest-cov
- pytest-xdist
- pip: ## Add in pip packages if necessary
- mkdocs
- mkdocs-material
- mkdocstrings
- mknotebooks
If a package exists in both conda-forge
and pip
and you rely primarily on conda
,
then I recommend prioritizing the conda
package over the pip
package.
The advantage here is that conda
's dependency solver
can grab the latest compatible version
without worrying about pip
clobbering over other dependencies.
(h/t my reviewer Simon, who pointed out that
newer versions of pip
have a dependency solver,
though as far as possible, staying consistent is preferable,
though mixing-and-matching is alright if you know what you're doing.)
This baseline helps me bootstrap conda environments. The packages that are in there each serve a purpose. You can read more about them on the page: Install code checking tools to help write better code.
Initially, I only specify the version of Python I want, and allow the conda package manager to solve the environment.
However, there may come a time when a new package version brings a new capability. That is when you may wish to pin the version of that particular package to be at the minimum that version. (See below for the syntax needed to pin a version.) At the same time, the new package version may break compatibility -- in this case, you will want to pin it to a maximum package version.
It's not always obvious, though, so be sure to use version control
If you wish, you can also pin versions to a minimum, maximum, or specific one, using version modifiers.
>
, >=
, =
, <=
and <
. (You should be able to grok what is what!)>
, >=
, ==
, <=
and <
. (Note: for pip, it is double equals ==
and not single equals =
.)So when do you use each of the modifiers?
=
/==
sparingly while in development:
you will be stuck with a particular version
and will find it difficult to update other packages together.<=
and <
to prevent conda
/pip
from upgrading a package beyond a certain version.
This can be helpful if new versions of packages you rely on have breaking API changes.>=
and >
to prevent conda
/pip
from installing a package below a certain version.
This is helpful if you've come to depend on breaking API changes from older versions.Upgrading and/or installing packages should be done on an as-needed basis. There are two paths to do upgrade packages that I have found:
The principled way to do an upgrade is to first pin the version inside environment.yml
,
and then use the following command to update the environment:
conda env update -f environment.yml
The hacky way to do the upgrade is to directly conda
or pip
install the package,
and then add it (or modify its version) in the environment.yml
file.
Do this only if you know what you're doing!
By practicing "one project gets one environment",
then ensuring that those environments' Python interpreters are available to Jupyter
is going to be crucial.
If you find that your project's environment Python is unavailable,
then you'll need to ensure that it's available.
To do so, ensure that the Python environment has the package ipykernel
.
(If not, install it by hand and add it to the environment.yml
file.)
Then, run the following command:
# assuming you have already activated your environment,
# replace $ENVIRONMENT_NAME with your environment's name.
python -m ipykernel install --user --name $ENVIRONMENT_NAME
Now, it will show up as a "kernel" for executing Python code in your Jupyter notebooks. (see Configure Jupyter and Jupyter Lab for more information on how to configure it.)
Now, how should you name your conda environment? See the page: Sanely name things consistently!
Sanely name things consistently
Think about the following scenario:
sales-forecast
forecast-2020
my_env
my_source
.Are you going to be able to ever mentally map them to one another? Probably not, though maybe if you did put in the effort to do so, you might be able to. That said, if you work with someone else on the project, you're only going to increase the amount of mental work they need to do to keep things straight.
Now, consider a different scenario:
Sales Forecast 2020
sales-forecast-2020
sales-forecast-2020-env
sales_forecast_2020
.Does the latter seem saner? I think so too :).
I think the following guidelines help:
I would add that learning how to name things precisely in English, and hence provide precise variable names in Python code, is a great way for English second language speakers to practice and expand their language vocabulary.
As one of my reviewers (Logan Thomas) pointed out, leveraging the name to help newcomers distinguish between entities is helpful too. For this reason, your environment can be suffixed with a consistent noun; for example, I have -dev
as a suffix to make for software package-oriented projects; above, we used -env
as a suffix (making sales-forecast-2020-env
) to indicate to a newcomer that we're activating an environment when we conda activate sales-forecast-2020-env
. As long as you're consistent, that's not a problem!
Configure VSCode for maximum productivity
VSCode has a built-in configuration setting that you can access using Ctrl/Cmd
followed by a ,
(comma).
At the same time, VSCode is extensible using its Marketplace of extensions.
Save some keystrokes by configuring VSCode to autosave your files after 10 ms. This is useful if your workflow doesn't involve running a live server that reloads code on every save.
When you explicitly command VSCode to save your file, it can run code formatters (such as black) on the file being saved.
This option will insert a final newline to a plain text file. Makes viewing them in the terminal much easier.
This will clean out trailing whitespace. Again, makes viewing them in the terminal a bit easier.
The official Python extension for VSCode contains a ton of goodies that are configurable in the settings. It can help you load conda environments in the shell directly, automatically lint source code that is open, debug a program that has to be executed, automatic importing of resolvable functions and modules, and many, many more goodies for Python developers and data scientists.
In fact, you can write a .py
file and execute it interactively as if it were a Jupyter notebook, without the overhead of the Jupyter front-end interface. Steven Mortimer even has an R-bloggers post on how to configure VSCode to behave like RStudio for Python, for those who prefer the RStudio interface!
With the Python extension being powerful as it is, the Pylance extension will supercharge it even more. Using an extremely performant code analyzer, it will highlight nearly all problems it detects within a source .py
file within a few hundred milliseconds (maximum) of the file being saved. When I have had to do tool development as part of my data science work, Pylance has been essential for my productivity.
One good practice we have written about here is to Follow the rule of one-to-one in managing your projects. When one project has one directory, that directory acts as a "workspace", which in turn gets opened in one VSCode window. If you have 3-4 windows open, then figuring out which one corresponds to which project can take a few seconds.
This extension gives you superpowers. It will enable you to edit code on a remote server while still having all of the goodies of a locally-running VSCode session. Leverage this extension to help you develop on a powerful remote machine without ever leaving your local editor.
This extension highlights indentation in different colours, making it easier to read Python (and other indentation-friendly languages') source files.
Rainbow CSV highlights the columns in a CSV file when you open it up in VSCode, making it easier to view your data. This one was suggested by one of my book reviewers Simon Eng.
This extension checks your Markdown files for formatting issues, such as headers containing punctuations, or missing line breaks after a header. These are based on the Node.js markdownlint
package by David Anson.
Also suggested by Simon, Markdown Table Prettifier helps you format your Markdown tables such that they are easily readable in plain text mode.
Polacode is "polaroid for code", giving you the ability to create "screenshots" of your code to share with others. If you've ever used carbon.now.sh, you'll enjoy this one.
Place custom source code inside a lightweight package
Have you encountered the situation where you create a new notebook, and then promptly copy code verbatim from another notebook with zero modifications?
As you as you did that, you created two sources of truth for that one function.
Now... if you intended to modify the function and test the effect of the modification on the rest of the code, then you still could have done better.
A custom source package that is installed into the conda environment that you have set up will help you refactor code out of the notebook, and hence help you define one source of truth for the entire function, which you can then import anywhere.
Firstly, I'm assuming you are following the ideas laid out in Set up your project with a sane directory structure. Specifically, you have a src/
directory under the project root. Here, I'm going to give you a summary of the official Python packaging tutorial.
In your project project_name/
directory, ensure you have a few files:
|- project_name/ # should be the same name as the conda environment
|- data/ # for all data-related functions
|- loaders.py # convenience functions for loading data
|- schemas.py # this is for pandera schemas
|- __init__.py # this is necessary
|- paths.py # this is for path definitions
|- utils.py # utiity functions that you might need
|- ...
|- tests/
|- test_utils.py # tests for utility functions
|- ...
|- pyproject.toml. # replacement for setup.py
If you're wondering about why we name the source package the same name as our conda environment, it's for consistency purposes. (see: Sanely name things consistently)
If you're wondering about the purpose of paths.py
, read this page: Use pyprojroot to define relative paths to the project root
pyproject.toml
should look like this:
[project]
name = "my-package-name"
version = "0.1.0"
authors = [{name = "EM", email = "me@em.com"}]
description = "Something cool here."
Now, you activate the environment dedicated to your project (see: Create one conda environment per project) and install the custom source package:
conda activate project_environment
pip install -e .
This will install the source package in development mode. As you continue to add more code into the custom source package, they will be instantly available to you project-wide.
Now, in your projects, you can import anything from the custom source package.
Note: If you've read the official Python documentation on packages, you might see that src/
has nothing special in its name. (Indeed, one of my reviewers, Arkadij Kummer, pointed this out to me.) Having tried to organize a few ways, I think having src/
is better for DS projects than having the setup.py
file and source_package/
directory in the top-level project directory. Those two are better isolated from the rest of the project and we can keep the setup.py
in src/
too, thus eliminating clutter from the top-level directory.
As often as you need it!
Also, I would encourage you to avoid releasing the package standalone until you know that it ought to be used as a standalone Python package. Otherwise, you might prematurely bring upon yourself a maintenance burden!
It feels like a lot to remember, right? Fret not! You can use pyds-cli to easily bootstrap a new project environment!
Build a continuous integration pipeline for your source
If you end up writing software (see: Place custom source code inside a lightweight package), especially code that you might need to depend on in the future, having a test suite is essential (see: Write tests that test your custom code). However, the execution of the tests still needs to be triggered by you.
A continuous integration (CI) pipeline solves that problem for you. When configured correctly, on every commit you make to your codebase, it will automatically:
You can think of a continuous integration pipeline as a programmable bot that runs commands that you've configured it to run, except it does so automatically on every single commit.
You can configure a CI pipeline to automatically run code checks, thus preventing you from breaking something that you previously wrote on which you also depend.
You can also configure a CI pipeline to continuously run analyses that are crucial to the project. You essentially feed the CI pipeline the commands needed to re-run analyses that are important and deposit the results in a location that you get to configure.
If you don't build a CI pipeline, then you'll miss out on the benefits of automatically having a bot check your work for breakages.
There's a myriad of CI providers. Here are a few examples:
Because of the myriad of options available, it'd be futile to give you a tutorial. Instead, I'll show you what's common between them.
Firstly, you begin by writing a configuration file that lists out all of the build steps. Typically it's a YAML file (Travis CI, Azure Pipelines, and GitHub Actions all use this), but sometimes you'll have other formats, such as a Jenkinsfile for Jenkins. This file is, by convention, usually placed in the root of your project repository, but you can also opt to put it in another location if that helps with file organization.
Most commonly, the build steps will be nothing more than bash commands. For example, in Travis CI, each build step in the YAML file is a bash command used to execute the pipeline. Sometimes, to take advantage of the user-friendly UI elements provided by the CI provider, you'll be asked to supply a slightly more complex YAML file. There, you can group build steps into logical higher-order steps and provide human-readable descriptions for them; these get paired with a web UI that lets you easily debug a step when something goes wrong.
Secondly, there'll be a website (sometimes called a "control plane" in cloud jargon) where you go to configure the continuous integration bot. There, you'll typically configure:
If your company has set up internal systems slightly differently, you'll probably have to ask your IT department's DevOps team for help to accomplish your task. Ask nicely; they invest tons of time building out something usable, but sometimes the data scientist's level of expertise with these systems, which is usually beginner, is out of their radars.
One project should get one git repository
This helps a ton with organization. When you have one project targeted to one Git repository, you can easily house everything related to that project in that one Git repository. I mean everything. This includes:
In doing so, you have one mental location that you can point to for everything related to a project. This is a saner way of operating than over-engineering the separation of concerns at the beginning, with docs in one place and out-of-sync with the source code in another place... you get where we're going with this point.
Easy! Create your Git repo for the project, and then start putting stuff in there :).
Enough said here!
What should you name the Git repo? See the page: Sanely name things consistently
After you have set up your Git repo, make sure to Set up your project with a sane directory structure.
Also, Set up an awesome default gitignore for your projects!
Write effective documentation for your projects
As your data science project progresses, you should be documenting your work somehow inside your project. Your future self and other colleagues will need mental context to help get up-to-speed with the project. That mental context can mean the difference between staying on course or veering off in unproductive directions.
Useful documentation helps you quickly onboard collaborators to the project. By reading your documentation, you will help them get oriented and know how to get things done with your project. You won't be available forever to everyone who might come by, so your documentation effectively scales the longevity and impact of your work.
To write effective documentation, we first need to recognize that there are actually four types of documentation. They are, respectively:
This is not a new concept, it is actually well-documented (ahem!) in the Diataxis Framework.
Concretely, here are some kinds of documentation that you will want to focus on.
The first is custom source code docstrings (a type of Reference). We write docstrings inside Python functions to document what we intend to accomplish with the code block and why that code needs to exist. Be diligent about writing down the why behind the what; it will help you recall the "what" later on.
The second is how-to guides for newcomers to the project (obviously under the How-to Guides category). These guides help your newcomers get up to speed on the commands needed to set up a local development environment for their project. Essentially the sequence of terminal incantations that they would need to type to start hacking on the project. As always, for non-obvious steps, always document the why!
The third involves crafting and telling a story of the project's progress. (We may consider this to be an Explanation-style documentation). For those of you who have done scientific research before, you'll know how this goes: it's essentially the lab meeting presentations that you deliver! Early on, your progress will be granular, but your progress will gain momentum as the project progresses. Doing this is important because the act of reflecting on prior work, summarizing, and linearizing it for yourself helps you catch logical gaps that need to be filled in, essentially identifying where you need to focus your project efforts.
The final one is the project README! The README usually exists as README.md
or README.txt
in the project root directory and serves a few purposes:
The README file usually serves dual-purposes, both as a quick Tutorial and How-to Guide.
On this matter, I would advocate that we simultaneously strive to be simple and automated. For Pythonistas, there are two predominant options that you can go with: Sphinx and MkDocs.
At first glance, most in the Python world would advocate for the use of Sphinx, which is the stalwart package used to generate documentation. Sphinx's power lies in its syntax and ecosystem of extensions: you can easily link out to other packages, build API documentation from docstrings, run examples in documentation as tests, and more.
However, if you're not already familiar with Sphinx, I would recommend getting started using MkDocs. Its core design is much simpler, relying only on Markdown files as the source for documentation. That is MkDoc's most significant advantage: from my vantage point, Markdown syntax knowledge is more widespread than Sphinx syntax knowledge; hence, it's much easier to invite collaborators to write documentation together. (Hint: the MkDocs Material theme by Squidfunk has a ton of super excellent features that easily enhance MkDocs!)
Firstly, you should define a single source of truth for statements that you make in your docs. If you can, avoid copy/pasting anything. Related ideas here are written in Define single sources of truth for your data sources.
Secondly, you'll want to pick from several styles of writing. One effective way is to think of it in terms of answering critical questions for a project. An example list of questions that commonly show up in data projects mirror that of a scientific research paper and include (but are not limited to):
If your project also encompasses a tool that helps routinize the project in a production setting:
As one of my reviewers, Simon Eng, mentioned, the overarching point is that your documentation should explain to someone else what's going on in the project.
Finally, it would be best if you used semantic line breaks, also known as semantic line feeds. Go ahead. I know you're curious; click on the links to learn why :).
I strongly recommend reading the Write The Docs guide to writing technical documentation.
Additionally, Admond Lee has additional reasons for writing documentation.
Install Anaconda on your machine
Anaconda is a way to get a Python installed on your system.
One of the neat but oftentimes confusing things about Python is that you can have multiple Python executables living around on your system. Anaconda makes it easy for you to:
Why is this a good thing? Primarily because you might have individual projects that need different version of Python and different versions of packages that are built for Python. Also, default Python installations, such as the ones shipped with older versions of macOS, tend to be versions behind the latest, which is to the detriment of your projects. Some built-in apps in an operating system may depend on that old version of Python (such as iPhoto), which means if you mess up the installation, you might break those built-in apps. Hence, you will want a tool that lets you easily create isolated Python environments.
The Anaconda Python distribution fulfills the following key needs:
Installing Anaconda on your local machine thus helps you get easy access to Python, Jupyter (see: Use Jupyter as an experimentation playground), and other tools for modelling and analysis.
To install the Miniforge variant of Anaconda, which will be lighter-weight than the full Anaconda distribution, using the following command:
cd ~
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" -O anaconda.sh
This will send you to your home directory, and then download the Miniforge bash script installer from Anaconda's download page as anaconda.sh
.
Now, install Anaconda:
bash anaconda.sh -b -p $HOME/anaconda/
This will install the Anaconda distribution of Python onto your system inside your home directory. You can now install packages at will, without needing sudo privileges!