Opinionated choices
Dependence on conda
The Anaconda distribution of Python has become the de facto Python distribution recommended for data scientists to use.
Embracing software development practices
I strongly believe that models, at their core, are software. Hence, workflows commonly associated with software development, such as writing tests and documentation, ought to be part of a data scientist's workflow as well. As data scientists, if we don't embrace software development ideas, we set up future personal headache and barriers to current collaboration.
Software development workflows heavily relies on the concept of single sources of truth for stuff. This is the spirit behind the term refactoring, where we extract out stuff that might have been copied-and-pasted. That stuff might be a function that, after refactoring, we can import and use elsewhere. It might be a Markdown document that contains units of ideas that we refer to elsewhere. Or, it might a reference Jupyter notebook that teaches us the linearized logic of the analysis; in other words, an authoritative "report" that forms the basis of our knowledge sharing efforts. Buying in to "authoritative single sources of truth" are what prevents us from sloppily duplicating notebooks, copying and pasting code, and creating a world of confusion for future selves and collaborators.
Notebooks are best used for specific purposes
From personal experience and from reading others' experiences, we've come to the conclusion that notebooks are best used as
- A space for experimentation and prototyping, and
- Providing documentation for others.
Usage in any other form makes us more prone to sloppy workflows. (To be clear, I'm not saying that we are guaranteed to engage in sloppy workflows, I am merely saying that we will end up with fewer mental guardrails against sloppy workflows.)
Custom source code library
Based on the philosophy of "single sources of truth", PyDS CLI automatically creates a custom source code library that is automatically installed into the automatically-created conda environment that is also automatically named in a way that enables you to import into notebooks and other source code. By default, this is done as a local, editable install, enabling you to freely modify the source code and use the new code automatically without needing to re-install the package each time you make changes.
This source code library is a great place to refactor out code!
For example, if you write a custom PyMC3 or JAX model,
you can stick it into models.py
.
If you have data preprocessing code, you can stick them into preprocessing.py
.
And if you want to automatically validate dataframes,
write a pandera
schema
and stick it in schemas.py
!
Testing with pytest
Software testing gives you a contract between current self and others.
By writing a test down,
you're recording what you expect to be the behaviour of a function
at the time of implementation.
If in the future someone changes the function
or changes something that changes the expected inputs of that function,
then a test will help you catch deviations from those expectations.
pytest
is the modern way to handle software testing,
by abstracting away lots of boilerplate that would otherwise be written
using the built-in unittest
Python library.
Docs with mkdocs
MkDocs is a growing and popular alternative to Sphinx, which has historically been the backbone of Python project documentation. Being able to use pure Markdown to write docs confers a lot of advantages, but accessibility and ease-of-use is its biggest advantage: Its syntax is widely-known, it is easy for newcomers to pick up, one rarely needs to refer to Markdown documentation to look up syntax, it is easily portable to other formats (PDF, LaTeX, HTML, etc.), and for web-targeted documentation, advanced users can easily mix in custom HTML as necessary.
Building docs with Jupyter notebooks
Jupyter notebooks are awesome for prototyping and documenting.
By including mknotebooks
automatically,
we give ourselves the superpower of being able to
include Jupyter notebooks inside our docs.
If paired with a CI/CD bot that always executes notebooks that are part of docs, then suddenly, our docs are living docs, and can effectively serve as an 'integration test' for our analysis code.
Data never live in the repo
Data should never live in the repository.
Data should never live in the repository.
Data should never, ever live in the repository.
In a cloud-first world, we should be able to reliably store files in a remote location and pull them into our machine on demand.
Leverage CI/CD as a bot
CI/CD gets bandied about as a fancy term in some circles. I tend to think of CI/CD in a much, much more simple fashion: it's a robot. 😅 CI/CD systems, such as GitHub Actions, are robots that you can program using YAML (or, gasp! Jenkins) files rather than Bash scripts alone. We provide some basic CI/CD files in the form of GitHub Actions configuration files that automatically provide code checking, testing, and docs building on GitHub.
GitHub Actions
We target GitHub actions as an opinionated choice. GitHub Actions hits the right level of abstraction for CI/CD pipelining. The syntax is easy to learn, it has easily mix-and-matchable components, and is free for open source projects. PyDS CLI ships with GitHub actions that will help you check your code style and automatically run tests; you don't have to remember to add them in! (But feel free to delete these files if they're not necessary; they're generally harmless outside of GitHub so we don't bother asking if you want them or not and instead leave them to you to remove.)
Pre-commit hooks
We primarily use pre-commit hooks to help you catch code quality issues
before you git commit
them.
We use the pre-commit
framework
to manage the code checking tools.
You can think of pre-commit as a bot that runs code checks
before you're allowed to commit the code;
if any fails, you'll be alerted.
We use checkers for docstring coverage,
docstring argument coverage,
code formatting,
and ensuring notebook outputs are never committed.
Using pre-commit can be a bit jarring at first, because you might have a ton of work being committed. That said, though, if you commit often and early, then you'll catch your code quality issues on a faster turnaround cycle, which makes fixing code issues much easier than otherwise.
Development Containers
Development containers are another innovation in the developer tooling space that gives us the ability to work in an even more isolated environment than pure conda environments alone. If we were to use Russian Matryoshka dolls (the nested dolls) as an analogy, Docker-based development containers are the largest doll that ship a computer, while conda environments are the second largest doll that ship an isolated Python interpreter. Development containers work well with VSCode; we ship an opinionated VSCode dev container Dockerfile definition that VSCode can automatically discover and build for you. If dev containers and, more broadly, GitHub codespaces are something you use, then you'll have no barriers to starting them up.