Write effective documentation for your projects
As your data science project progresses, you should be documenting your work somehow inside your project. Your future self and other colleagues will need mental context to help get up-to-speed with the project. That mental context can mean the difference between staying on course or veering off in unproductive directions.
Useful documentation helps you quickly onboard collaborators to the project. By reading your documentation, you will help them get oriented and know how to get things done with your project. You won't be available forever to everyone who might come by, so your documentation effectively scales the longevity and impact of your work.
To write effective documentation, we first need to recognize that there are actually four types of documentation. They are, respectively:
This is not a new concept, it is actually well-documented (ahem!) in the Diataxis Framework.
Concretely, here are some kinds of documentation that you will want to focus on.
The first is custom source code docstrings (a type of Reference). We write docstrings inside Python functions to document what we intend to accomplish with the code block and why that code needs to exist. Be diligent about writing down the why behind the what; it will help you recall the "what" later on.
The second is how-to guides for newcomers to the project (obviously under the How-to Guides category). These guides help your newcomers get up to speed on the commands needed to set up a local development environment for their project. Essentially the sequence of terminal incantations that they would need to type to start hacking on the project. As always, for non-obvious steps, always document the why!
The third involves crafting and telling a story of the project's progress. (We may consider this to be an Explanation-style documentation). For those of you who have done scientific research before, you'll know how this goes: it's essentially the lab meeting presentations that you deliver! Early on, your progress will be granular, but your progress will gain momentum as the project progresses. Doing this is important because the act of reflecting on prior work, summarizing, and linearizing it for yourself helps you catch logical gaps that need to be filled in, essentially identifying where you need to focus your project efforts.
The final one is the project README! The README usually exists as README.md
or README.txt
in the project root directory and serves a few purposes:
The README file usually serves dual-purposes, both as a quick Tutorial and How-to Guide.
On this matter, I would advocate that we simultaneously strive to be simple and automated. For Pythonistas, there are two predominant options that you can go with: Sphinx and MkDocs.
At first glance, most in the Python world would advocate for the use of Sphinx, which is the stalwart package used to generate documentation. Sphinx's power lies in its syntax and ecosystem of extensions: you can easily link out to other packages, build API documentation from docstrings, run examples in documentation as tests, and more.
However, if you're not already familiar with Sphinx, I would recommend getting started using MkDocs. Its core design is much simpler, relying only on Markdown files as the source for documentation. That is MkDoc's most significant advantage: from my vantage point, Markdown syntax knowledge is more widespread than Sphinx syntax knowledge; hence, it's much easier to invite collaborators to write documentation together. (Hint: the MkDocs Material theme by Squidfunk has a ton of super excellent features that easily enhance MkDocs!)
Firstly, you should define a single source of truth for statements that you make in your docs. If you can, avoid copy/pasting anything. Related ideas here are written in Define single sources of truth for your data sources.
Secondly, you'll want to pick from several styles of writing. One effective way is to think of it in terms of answering critical questions for a project. An example list of questions that commonly show up in data projects mirror that of a scientific research paper and include (but are not limited to):
If your project also encompasses a tool that helps routinize the project in a production setting:
As one of my reviewers, Simon Eng, mentioned, the overarching point is that your documentation should explain to someone else what's going on in the project.
Finally, it would be best if you used semantic line breaks, also known as semantic line feeds. Go ahead. I know you're curious; click on the links to learn why :).
I strongly recommend reading the Write The Docs guide to writing technical documentation.
Additionally, Admond Lee has additional reasons for writing documentation.
Define single sources of truth for your data sources
Let me describe a scenario: there's a project you're working on with others, and everybody depends on an Excel spreadsheet. This was before the days of collaboratively editing a single Excel spreadsheet was a possibility. To avoid conflicts, someone creates a spreadsheet_v2.xlsx
, and then at the same time, another person creates spreadsheet_TE_edits.xlsx
.
Which version do you trust?
The worst part? Neither of those spreadsheets contained purely raw data; they were a mix of both raw data and derived data (i.e. columns that are calculated off or from other columns). The derived data are not documented with why and how they were calculated; their provenance is unknown, in that we don't know who made those changes, and who to ask questions on those columns.
Rather than wrestling with multiple sources of truth, a data analysis workflow can be much more streamlined by defining a single source of truth for raw data that does not contain anything derived, followed by calculating the derived data in a custom source code (see: Place custom source code inside a lightweight package), written in such a way that they yield logical derived data structures for the problem (see: Iteratively scope out and define the most appropriate data structures for your problem). Those single sources of truth can also be described by a ground truth data descriptor file (see Write data descriptor files for your data sources), which give you the provenance of the file and a human-readable descriptor of each of the sources.
If your organization uses the cloud, then AWS S3 (or compatible bucket stores) might be available. A data source might be dumped on there and referenced by a single URL. That URL is your "single source of data"
Your organization might have the resources to build out a data store with proper access controls and the likes. They might provide a unique key and a software API (RESTful, or Python or R package) to download data in an easy fashion. That "unique key" + the API defines your single source of truth.
Longer-lived organizations might have started out with a shared networked filesystem, with access controls granted by UNIX-style user groups. In this case, the /path/to/the/data/file
+ access to the shared filesystem is your source of truth.
This one should be easy to grok: a URL that points to the exact CSV, Parquet, or Excel table, or a zip dump of images, is your unique identifier.
Follow the rule of one-to-one in managing your projects
The one-to-one rule essentially means this. Each project that we work on gets:
In addition, when we name things, such as environment names, repository names, and more, we choose names that are consistent with one another (see: Sanely name things consistently for the reasons why).
Conventions help act as a lubricant - a shortcut for us to interact with others. Adopting the convention of one-to-one mappings helps us manage some of the complexity that may arise in a project.
Some teams have a habit of putting source code in one place (e.g. Bitbucket) and documentation in another (e.g. Confluence). I would discourage this; placing source code and documentation on how to use it next to each other is a much better way to work, because it gives you and your project stakeholders one single source of truth to find information related to a project.
A few guidelines can help you decide.
When a source repository matures enough such that you see a submodule that is generalizable beyond the project itself, then it's time to engage the help of a real software developer to refactor that chunk of code out of the source file into a separate package.
When the project matures enough such that there's a natural bifurcation in work that needs more independence from the original repository, then it's time to split the repository into two. At that point, apply the same principles to the new repository.
Get prepped per project
Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.
Firstly, some overall ideas to ground the specifics:
Some ideas pertaining to Git:
Notes that pertain to organizing files:
Notes that pertain to your compute environment:
And notes that pertain to good coding practices:
Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.
One project should get one git repository
This helps a ton with organization. When you have one project targeted to one Git repository, you can easily house everything related to that project in that one Git repository. I mean everything. This includes:
In doing so, you have one mental location that you can point to for everything related to a project. This is a saner way of operating than over-engineering the separation of concerns at the beginning, with docs in one place and out-of-sync with the source code in another place... you get where we're going with this point.
Easy! Create your Git repo for the project, and then start putting stuff in there :).
Enough said here!
What should you name the Git repo? See the page: Sanely name things consistently
After you have set up your Git repo, make sure to Set up your project with a sane directory structure.
Also, Set up an awesome default gitignore for your projects!
Write data descriptor files for your data sources
When you get a new CSV file, how do you know what the semantic meaning of each column is, what null values are, and other background information of that file?
Usually, we'd go in and ask another person. However, that's not scalable. Instead, if we provided a human-readable text file that provided all of the aforementioned information, that would be awesome! In comes the data descriptor file. (In the clinical research world, they are also known as "data dictionaries".)
But beyond that, the data descriptor file has another benefit! It takes manual work to sit down and comb through each file and provide a description of each of its columns, where the data came from, and more. This is all part of the process of understanding the data generating process, which is incredibly helpful for downstream modelling efforts. In essence, writing a data descriptor file per data file is an incredibly great first step in the exploratory data analysis (EDA) stage of doing data analysis, because you are literally exploring the structure of the data.
These are two great reasons to write descriptor files, which beat out the single downside: "it takes time".
At its most basic form, you can simply write a README file for each data source. Plain text, fully customizable.
That said, some lightweight structure can help. I have previously opted for a YAML file format, which is both human readable and computer-parseable. In that YAML file, we can describe the table schema using the frictionless data TableSchema spec. One can also go for the full JSON that they specify (but it's not as easy to write by hand). In choosing to go with a specification, we effectively gain a checklist, helping us remember to describe everything that could be necessary!
If you primarily handle tabular data (which, if my understanding is correct, forms the vast majority of data science use cases), then I would strongly suggest using pandera
to not only validate your data (see: Validate your data wherever practically possible) but also to generate dataframe schemas that you can store as code. Pandera comes with the ability to generate a starter dataframe schema that one can continually update as data arrive. Storing your data descriptor as code not only allows you to annotate it with comments but also use it for validation itself: a double win.
Adhere to best git practices
Git is a unique piece of software. It does one and only one thing well: store versions of hand-curated files. Adhering to Git best practices will ensure that you use Git in its intended fashion.
The most significant point to keep in mind: only commit to Git files that you have had to create manually. That usually means version controlling:
There are also things you should actively avoid committing.
For specific files, you can set up a .gitignore
file.
See the page Set up an awesome default gitignore for your projects
for more information on preventing yourself from committing them automatically.
For Jupyter notebooks,
it is considered good practice to avoid committing notebooks that still have outputs.
It is best to clear them out using nbstripout
.
That can be automated before committing them through the use of pre-commit hooks.
(See: Set up pre-commit hooks to automate checks before making Git commits)
Install code checking tools to help write better code
If you're writing code, then having code checking tools that automatically check for potential issues while you write the code is like having a spell and grammar checker always on. It's also like having Grammarly check your spelling, grammar, and style all the time.
Code style can drift as a project proceeds. If you work with colleagues, code style nitpicks can become a source of frustration in interactions with them. Having code style checkers that automatically flag code style that deviates from a pre-defined norm can go a long way to easing these potential conflicts.
Firstly, code style formatting tools. For Python projects:
You can configure black
and isort
using a pyproject.toml
config file.
Secondly, code problems. For Python projects:
You can configure some of these tools to work with Python source files in VSCode! (see: Use VSCode to help you with software development and collaboration) For example, you can configure VSCode to format your code using black on every save, so you don't have to keep running black before committing your code.
Many authors have written tons of enthusiastic blog posts on Python code style tooling. Here's a sampling of them:
Code style and code format tooling can be a rabbit hole that you run down. Those that I have listed above should give you a great starting point.