Use pyprojroot to define relative paths to the project root
If you follow the practice of One project should get one git repository, then everything related to the project will be housed inside that repository. Under this assumption, if you also develop a custom source code library for your project (see Place custom source code inside a lightweight package for why), then you'll likely encounter the need to find paths to things, such as data files, relative to the project root. Rather than hard-coding paths into your library and Jupyter notebooks, you can instead leverage pyprojroot
to define a library of paths that are useful across the project.
Firstly, make sure you have an importable source_package.paths
module. (I'm assuming you have written a custom source package!) In there, define project paths:
from pyprojroot import here
root = here(proj_files=[".git"])
notebooks_dir = root / "notebooks"
data_dir = root / "data"
timeseries_data_dir = data_dir / "timeseries"
here()
returns a Python pathlib.Path
object.
You can go as granular or as coarse-grained as you want.
Then, inside your Jupyter notebooks or Python scripts, you can import those paths as needed.
from source_package.paths import timeseries_data_dir
import pandas as pd
data = pd.read_csv(timeseries_data_dir / "2016-2019.csv")
Now, if for whatever reason you have to move the data files to a different subdirectory (say, to keep things even more organized than you already are, you awesome person!), then you just have to update one location in source_package.paths
, and you're able to reference the data file in all of your scripts!
See also: Define single sources of truth for your data sources.
Define single sources of truth for your data sources
Let me describe a scenario: there's a project you're working on with others, and everybody depends on an Excel spreadsheet. This was before the days of collaboratively editing a single Excel spreadsheet was a possibility. To avoid conflicts, someone creates a spreadsheet_v2.xlsx
, and then at the same time, another person creates spreadsheet_TE_edits.xlsx
.
Which version do you trust?
The worst part? Neither of those spreadsheets contained purely raw data; they were a mix of both raw data and derived data (i.e. columns that are calculated off or from other columns). The derived data are not documented with why and how they were calculated; their provenance is unknown, in that we don't know who made those changes, and who to ask questions on those columns.
Rather than wrestling with multiple sources of truth, a data analysis workflow can be much more streamlined by defining a single source of truth for raw data that does not contain anything derived, followed by calculating the derived data in a custom source code (see: Place custom source code inside a lightweight package), written in such a way that they yield logical derived data structures for the problem (see: Iteratively scope out and define the most appropriate data structures for your problem). Those single sources of truth can also be described by a ground truth data descriptor file (see Write data descriptor files for your data sources), which give you the provenance of the file and a human-readable descriptor of each of the sources.
If your organization uses the cloud, then AWS S3 (or compatible bucket stores) might be available. A data source might be dumped on there and referenced by a single URL. That URL is your "single source of data"
Your organization might have the resources to build out a data store with proper access controls and the likes. They might provide a unique key and a software API (RESTful, or Python or R package) to download data in an easy fashion. That "unique key" + the API defines your single source of truth.
Longer-lived organizations might have started out with a shared networked filesystem, with access controls granted by UNIX-style user groups. In this case, the /path/to/the/data/file
+ access to the shared filesystem is your source of truth.
This one should be easy to grok: a URL that points to the exact CSV, Parquet, or Excel table, or a zip dump of images, is your unique identifier.
Create runtime environment variable configuration files for each of your projects
When you work on your projects, one assumption you will usually have is that your development environment will look like your project's runtime environment with all of its environment variables. The runtime environment is usually your "production" setting: a web app or API, a model in a pipeline, or a software package that gets distributed. (For more on environment variables, see: Take full control of your shell environment variables)
Here, I'm assuming that you follow the practice of and that you Use pyprojroot to define relative paths to the project root.
To configure environment variables for your project,
a recommended practice is to create a .env
file in your project's root directory,
which stores your environment variables as such:
export ENV_VAR_1 = "some_value"
export DATABASE_CONNECTION_STRING = "some_database_connection_string"
export ENV_VAR_3 = "some_other_value"
We use the export
syntax here because we can, in our shells,
run the command source .env
and have the environment variables defined in there applied to our environment.
Now, if you're using a Python project,
make sure you have the package python-dotenv
(Github repo here)
installed in the conda environment.
Then, in your Python .py
source files:
from dotenv import load_dotenv
from pyprojroot import here
import os
dotenv_path = here() / ".env"
load_dotenv(dotenv_path=dotenv_path) # this will load the .env file in your project directory root.
# Now, get the environment variable.
DATABASE_CONNECTION_STRING = os.getenv("DATABASE_CONNECTION_STRING")
In this way, your runtime environment variables get loaded into the runtime environment, and become available to all child processes started from within the shell (e.g. Jupyter Lab, or Python, etc.).
Your .env file might contain some sensitive secrets.
You should always ensure that your .gitignore
file contains .env
in it.
See also: Set up an awesome default gitignore for your projects
Get prepped per project
Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.
Firstly, some overall ideas to ground the specifics:
Some ideas pertaining to Git:
Notes that pertain to organizing files:
Notes that pertain to your compute environment:
And notes that pertain to good coding practices:
Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.
One project should get one git repository
This helps a ton with organization. When you have one project targeted to one Git repository, you can easily house everything related to that project in that one Git repository. I mean everything. This includes:
In doing so, you have one mental location that you can point to for everything related to a project. This is a saner way of operating than over-engineering the separation of concerns at the beginning, with docs in one place and out-of-sync with the source code in another place... you get where we're going with this point.
Easy! Create your Git repo for the project, and then start putting stuff in there :).
Enough said here!
What should you name the Git repo? See the page: Sanely name things consistently
After you have set up your Git repo, make sure to Set up your project with a sane directory structure.
Also, Set up an awesome default gitignore for your projects!
Define project-wide constants inside your custom package
There are some "basic facts" about a project that you might want to be able to leverage project-wide. One example of this might be data source files (CSVs, Excel spreadsheets) that you might want convenient paths to (see: Use pyprojroot to define relative paths to the project root).
Assuming you have a custom source package defined (see: Place custom source code inside a lightweight package), this is not difficult at all.
Ensure that you have a constants.py
, or else something named sanely, and place all of your constants in there as variables. (Paths should probably go in a paths.py
file.)
Then, import the constants (or paths) into your source project anywhere you need it!
Place custom source code inside a lightweight package
Have you encountered the situation where you create a new notebook, and then promptly copy code verbatim from another notebook with zero modifications?
As you as you did that, you created two sources of truth for that one function.
Now... if you intended to modify the function and test the effect of the modification on the rest of the code, then you still could have done better.
A custom source package that is installed into the conda environment that you have set up will help you refactor code out of the notebook, and hence help you define one source of truth for the entire function, which you can then import anywhere.
Firstly, I'm assuming you are following the ideas laid out in Set up your project with a sane directory structure. Specifically, you have a src/
directory under the project root. Here, I'm going to give you a summary of the official Python packaging tutorial.
In your project project_name/
directory, ensure you have a few files:
|- project_name/ # should be the same name as the conda environment
|- data/ # for all data-related functions
|- loaders.py # convenience functions for loading data
|- schemas.py # this is for pandera schemas
|- __init__.py # this is necessary
|- paths.py # this is for path definitions
|- utils.py # utiity functions that you might need
|- ...
|- tests/
|- test_utils.py # tests for utility functions
|- ...
|- pyproject.toml. # replacement for setup.py
If you're wondering about why we name the source package the same name as our conda environment, it's for consistency purposes. (see: Sanely name things consistently)
If you're wondering about the purpose of paths.py
, read this page: Use pyprojroot to define relative paths to the project root
pyproject.toml
should look like this:
[project]
name = "my-package-name"
version = "0.1.0"
authors = [{name = "EM", email = "me@em.com"}]
description = "Something cool here."
Now, you activate the environment dedicated to your project (see: Create one conda environment per project) and install the custom source package:
conda activate project_environment
pip install -e .
This will install the source package in development mode. As you continue to add more code into the custom source package, they will be instantly available to you project-wide.
Now, in your projects, you can import anything from the custom source package.
Note: If you've read the official Python documentation on packages, you might see that src/
has nothing special in its name. (Indeed, one of my reviewers, Arkadij Kummer, pointed this out to me.) Having tried to organize a few ways, I think having src/
is better for DS projects than having the setup.py
file and source_package/
directory in the top-level project directory. Those two are better isolated from the rest of the project and we can keep the setup.py
in src/
too, thus eliminating clutter from the top-level directory.
As often as you need it!
Also, I would encourage you to avoid releasing the package standalone until you know that it ought to be used as a standalone Python package. Otherwise, you might prematurely bring upon yourself a maintenance burden!
It feels like a lot to remember, right? Fret not! You can use pyds-cli to easily bootstrap a new project environment!