written by Eric J. Ma on 2020-04-21 | tags: data science pathlib python packages tools
If you adopt a proper organizational structure for your data projects,
then each project gets its own directory (i.e. a clean and isolated "workspace")
and its own isolated analysis environment (e.g. a conda
environment).
In that workspace, your directory structure might look like this:
project/ - data/ - notebooks/ - src/ - setup.py - README
As such, your notebook are all going to be in a different directory from your data.
This is one way that keeps the mind sane:
you might have subdirectories in the notebooks/
directory
that you use to organize the notebooks further,
yet you have multiple notebooks that use the same file,
leading to brittle path linking.
After all in one notebook, you might do:
import pandas as pd df = pd.read_csv("data.csv")
But in another notebook that lives in a different directory, to link to the dataset, you might have to do:
import pandas as pd df = pd.read_csv("../other_dir/data.csv")
The potential for confusion is just immense here.
A better way is to provide one authoritative path to a particular dataset that you can use. For example:
import pandas as pd df = pd.read_csv("../data/data.csv")
But even that is a bit tricky: if you move the notebook for whatever good reason, the path to the data might break. It’s still brittle. We need a better way to resolve paths.
Enter pyprojroot
.
Written by my fellow PyData conference doppleganger Daniel Chen,
it provides a here
function that will resolve to your project root directory (hence the package name).
The original was written in R (rprojroot
),
and it’s a wonderful tool for data scientists.
Let’s see it in action:
import pandas as pd from pyprojroot import here df = pd.read_csv(here() / "data/data.csv")
And voila!
No fragile relative paths,
and no perpetually long chains of ../../..
!
Just nice and clean resolution to your project root.
How does it work?
What pyprojroot
does underneath the hood is
recursively climb the file tree until it finds
one of a set of pre-specified files
that are commonly found in a project’s root directory.
For example, .git
is a common one.
For Python packages, setup.py
is another.
If your project doesn’t "fit" any of the conventions assumed,
or if you have a fancier structure,
you can always add a .here()
to your project root,
and configure the project_files
keyword argument
so that here
only looks for that one authoritative file:
import pandas as pd from pyprojroot import here root = here(project_files=[".here"]) df = pd.read_csv(root / "data/data.csv")
And what exactly is the here
function returning?
Well, it’s returning a pathlib.Path
object,
which has some seriously clever patching
to allow it to work with the /
operator
to represent paths in native Python code!
Now, let us all toast to cleaner path resolution in our data projects!
@article{
ericmjl-2020-use-paths,
author = {Eric J. Ma},
title = {Use pyprojroot and Python’s pathlib to manage your data paths!},
year = {2020},
month = {04},
day = {21},
howpublished = {\url{https://ericmjl.github.io}},
journal = {Eric J. Ma's Blog},
url = {https://ericmjl.github.io/blog/2020/4/21/use-pyprojroot-and-pythons-pathlib-to-manage-your-data-paths},
}
I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.
If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!
Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!