PyDS: A wrapper for creating, configuring, and managing your data science projects
Why this project exists?
We started by asking the simple questions:
Why do I have to memorize 4 bash incantations in order to release a Python package?
Why do I have to remember so many sequences of commands to do anything?
What is the kind of tooling that we need to support making "good" workflows easy?
PyDS was born out in response to these questions. We'd rather avoid the frustration of memorizing commands from a smattering of tools and repetitively recalling a particular folder structure from memory in order to set up my projects and perform common tasks (such as Python package publishing).
PyDS follows the philosophy that in order for data scientists to be efficient, they must have tooling at hand that automates the mundane, reduces the number of commands that they need to remember, and makes the sane things easy to do (that's riff off security folks' mantra, "making the right things easy to do").
In the spirit of automation, this project was thus born. With it, my aim here is to bring sanity to project initialization.
Quickstart
Ensure that you have the Anaconda distribution of Python installed,
and that conda
can be found using your PATH
environment variable.
Then, install from PyPI:
pip install pyds
For more information, take a look at the CLI page to see what commands exist!
Design philosophy
PyDS wraps workflows. Workflows are verbs that, underneath the hood, are implemented by a chain of shell commands. To read more, see the Design Philosophy page for more details.
Contributing
To learn how to contribute, head over to the Contributing page.
Inspirations
PyDS is inspired by a lot of conversations and reading others' work. I would like to acknowledge their ideas.
Cookiecutter Data Science
Cookiecutter Data Science (CDS)
provided a great starting point for the directory structure.
There are places we deviate from CDS,
such as omitting a data/
directory,
because in the cloud age,
we should be securely referencing single sources of truth for our data
by way of URIs, s3 buckets (or compatible), database connections, and more.
(My opinion is that data should not live in a project source repository.)
Without CDS, the inspiration for automation would not have existed.
Data Science Bootstrap Notes
This is my online book in which I documented a lot of the workflows and best practices that I developed over my career as a data scientist. It has some deficiencies, however, including a focus on tools, with insufficient focus on workflows. With PyDS, my goal is to bring the focus back on workflows.
How to organize your data science project
Many years ago (in 2017, to be precise), I wrote down my first ideas on the theme of "good data science project organization". The result was a GitHub gist with a lot of ideas, but not automation provided.
The Good Research Code Handbook
This is an excellent resource that I got wind of in December 2021. In it is a detailed handbook-style resource that lays out step-by-step instructions for structuring your data science and/or research project code.
Conversations with colleagues at Moderna
My conversations with colleagues on the DSAI team at Moderna were highly informative for this project.