Sanely name things consistently
Think about the following scenario:
Are you going to be able to ever mentally map them to one another? Probably not, though maybe if you did put in the effort to do so, you might be able to. That said, if you work with someone else on the project, you're only going to increase the amount of mental work they need to do to keep things straight.
Now, consider a different scenario:
Sales Forecast 2020
Does the latter seem saner? I think so too :).
I think the following guidelines help:
I would add that learning how to name things precisely in English, and hence provide precise variable names in Python code, is a great way for English second language speakers to practice and expand their language vocabulary.
As one of my reviewers (Logan Thomas) pointed out, leveraging the name to help newcomers distinguish between entities is helpful too. For this reason, your environment can be suffixed with a consistent noun; for example, I have
-dev as a suffix to make for software package-oriented projects; above, we used
-env as a suffix (making
sales-forecast-2020-env) to indicate to a newcomer that we're activating an environment when we
conda activate sales-forecast-2020-env. As long as you're consistent, that's not a problem!
Place custom source code inside a lightweight package
Have you encountered the situation where you create a new notebook, and then promptly copy code verbatim from another notebook with zero modifications?
As you as you did that, you created two sources of truth for that one function.
Now... if you intended to modify the function and test the effect of the modification on the rest of the code, then you still could have done better.
A custom source package that is installed into the conda environment that you have set up will help you refactor code out of the notebook, and hence help you define one source of truth for the entire function, which you can then import anywhere.
Firstly, I'm assuming you are following the ideas laid out in Set up your project with a sane directory structure. Specifically, you have a
src/ directory under the project root. Here, I'm going to give you a summary of the official Python packaging tutorial.
In your project
src/ directory, ensure you have a few files:
|- src/ |- setup.py |- source_package/ # rename this to the same name as the conda environment |- data/ # for all data-related functions |- loaders.py # convenience functions for loading data |- schemas.py # this is for pandera schemas |- __init__.py # this is necessary |- paths.py # this is for path definitionsme |- utils.py # utiity functions that you might need |- ... |- tests/ |- test_utils.py # tests for utility functions |- ...
If you're wondering about why we name the source package the same name as our conda environment, it's for consistency purposes. (see: Sanely name things consistently)
If you're wondering about the purpose of
paths.py, read this page: Use pyprojroot to define relative paths to the project root
setup.py should look like this:
import setuptools with open("README.md", "r", encoding="utf-8") as fh: long_description = fh.read() setuptools.setup( name="source_package", # Replace with your environment name version="0.1", # Replace with anything that you need packages=setuptools.find_packages(), )
Now, you activate the environment dedicated to your project (see: Create one conda environment per project) and install the custom source package:
conda activate project_environment cd src pip install -e .
This will install the source package in development mode. As you continue to add more code into the custom source package, they will be instantly available to you project-wide.
Now, in your projects, you can import anything from the custom source package.
Note: If you've read the official Python documentation on packages, you might see that
src/ has nothing special in its name. (Indeed, one of my reviewers, Arkadij Kummer, pointed this out to me.) Having tried to organize a few ways, I think having
src/ is better for DS projects than having the
setup.py file and
source_package/ directory in the top-level project directory. Those two are better isolated from the rest of the project and we can keep the
src/ too, thus eliminating clutter from the top-level directory.
As often as you need it!
Also, I would encourage you to avoid releasing the package standalone until you know that it ought to be used as a standalone Python package. Otherwise, you might prematurely bring upon yourself a maintenance burden!
Follow the rule of one-to-one in managing your projects
The one-to-one rule essentially means this. Each project that we work on gets:
In addition, when we name things, such as environment names, repository names, and more, we choose names that are consistent with one another (see: Sanely name things consistently for the reasons why).
Conventions help act as a lubricant - a shortcut for us to interact with others. Adopting the convention of one-to-one mappings helps us manage some of the complexity that may arise in a project.
Some teams have a habit of putting source code in one place (e.g. Bitbucket) and documentation in another (e.g. Confluence). I would discourage this; placing source code and documentation on how to use it next to each other is a much better way to work, because it gives you and your project stakeholders one single source of truth to find information related to a project.
A few guidelines can help you decide.
When a source repository matures enough such that you see a submodule that is generalizable beyond the project itself, then it's time to engage the help of a real software developer to refactor that chunk of code out of the source file into a separate package.
When the project matures enough such that there's a natural bifurcation in work that needs more independence from the original repository, then it's time to split the repository into two. At that point, apply the same principles to the new repository.
Create one conda environment per project
If you have multiple projects that you work on, but you install all project dependencies into a shared environment, then I guarantee you that at some point, you will run into dependency conflicts as you try to upgrade/update packages to try out new things.
"So what?" you might ask. Well, you'll end up breaking your code! Take this word of advice from someone who has had to deal with the consequences of having his code not working in one project even as code in another does. And finding out one day before an important presentation, and you have to put out figures. The horror!
You will want to ensure that you have an isolated conda environment for each project to keep your projects insulated from one another.
Here is a baseline that you can copy and modify at any time.
name: project ## CHANGE THIS TO YOUR ACTUAL PROJECT channels: ## Add any other channels below if necessary - conda-forge dependencies: ## Prioritize conda packages - python=3.8 - jupyter - conda - mamba - ipython - ipykernel - numpy - matplotlib - scipy - pandas - pip - pre-commit - black - nbstripout - mypy - flake8 - pycodestyle - pydocstyle - pytest - pytest-cov - pytest-xdist - pip: ## Add in pip packages if necessary - mkdocs - mkdocs-material - mkdocstrings - mknotebooks
If a package exists in both
pip and you rely primarily on
conda, then I recommend prioritizing the
conda package over the
pip package. The advantage here is that
conda's dependency solver can grab the latest compatible version without worrying about
pip clobbering over other dependencies. (h/t my reviewer Simon, who pointed out that newer versions of
pip have a dependency solver, though as far as possible, staying consistent is preferable, though mixing-and-matching is alright if you know what you're doing.)
This baseline helps me bootstrap conda environments. The packages that are in there each serve a purpose. You can read more about them on the page: Install code checking tools to help write better code.
Initially, I only specify the version of Python I want, and allow the conda package manager to solve the environment.
However, there may come a time when a new package version brings a new capability. That is when you may wish to pin the version of that particular package to be at the minimum that version. (See below for the syntax needed to pin a version.) At the same time, the new package version may break compatibility -- in this case, you will want to pin it to a maximum package version.
It's not always obvious, though, so be sure to use version control
If you wish, you can also pin versions to a minimum, maximum, or specific one, using version modifiers.
<. (You should be able to grok what is what!)
<. (Note: for pip, it is double equals
==and not single equals
So when do you use each of the modifiers?
==sparingly while in development: you will be stuck with a particular version and will find it difficult to update other packages together.
pipfrom upgrading a package beyond a certain version. This can be helpful if new versions of packages you rely on have breaking API changes.
pipfrom installing a package below a certain version. This is helpful if you've come to depend on breaking API changes from older versions.
Upgrading and/or installing packages should be done on an as-needed basis. There are two paths to do upgrade packages that I have found:
The principled way to do an upgrade is to first pin the version inside
environment.yml, and then use the following command to update the environment:
conda env update -f environment.yml
The hacky way to do the upgrade is to directly
pip install the package, and then add it (or modify its version) in the
environment.yml file. Do this only if you know what you're doing!
By practicing "one project gets one environment", then ensuring that those environments' Python interpreters are available to Jupyter is going to be crucial. If you find that your project's environment Python is unavailable, then you'll need to ensure that it's available. To do so, ensure that the Python environment has the package
ipykernel. (If not, install it by hand and add it to the
environment.yml file.) Then, run the following command:
# assuming you have already activated your environment, # replace $ENVIRONMENT_NAME with your environment's name. python -m ipykernel install --user --name $ENVIRONMENT_NAME
Now, it will show up as a "kernel" for executing Python code in your Jupyter notebooks. (see Configure Jupyter and Jupyter Lab for more information on how to configure it.)
Now, how should you name your conda environment? See the page: Sanely name things consistently!
Get prepped per project
Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.
Firstly, some overall ideas to ground the specifics:
Some ideas pertaining to Git:
Notes that pertain to organizing files:
Notes that pertain to your compute environment:
And notes that pertain to good coding practices:
Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.
One project should get one git repository
This helps a ton with organization. When you have one project targeted to one Git repository, you can easily house everything related to that project in that one Git repository. I mean everything. This includes:
In doing so, you have one mental location that you can point to for everything related to a project. This is a saner way of operating than over-engineering the separation of concerns at the beginning, with docs in one place and out-of-sync with the source code in another place... you get where we're going with this point.
Easy! Create your Git repo for the project, and then start putting stuff in there :).
Enough said here!
What should you name the Git repo? See the page: Sanely name things consistently
After you have set up your Git repo, make sure to Set up your project with a sane directory structure.