Start with a sane repository structure
Each project should get its own project repository. This helps a ton with organization. When you have one project targeted to one Git repository, you can easily house everything related to that project in that one Git repository. I mean everything. This includes:
- source code
- documentation
- data descriptors
- environment/configuration files
In doing so, you have one mental location that you can point to for everything related to a project. This is a saner way of operating than over-engineering the separation of concerns at the beginning, with docs in one place and out-of-sync with the source code in another place... you get where we're going with this point.
How to structure a standard repository structure
What should be part of the standard repository structure? I recommend the following minimum structure, assuming you're on GitHub:
├── .bumpversion.cfg
├── .devcontainer
│ ├── devcontainer.json
│ └── Dockerfile
├── .flake8
├── .github
│ ├── copilot-instructions.md
│ └── workflows
├── .gitignore
├── .pre-commit-config.yaml
├── docs
│ ├── api.md
│ ├── apidocs.css
│ ├── config.js
│ └── index.md
├── MANIFEST.in
├── mkdocs.yaml
├── pyproject.toml
├── README.md
├── tests
│ ├── test___init__.py
│ ├── test_cli.py
│ └── test_models.py
└── {{ cookiecutter.__module_name }}
├── __init__.py
├── cli.py
├── models.py
├── preprocessing.py
├── schemas.py
└── utils.py
This is taken from the cookiecutter-python-project template, but you can customize it to your liking.
Here's the purpose of each of those directories and files.
GitHub Directory (.github/
)
This is a GitHub-specific directory that houses workflows, GitHub-specific configurations, and documentation. The workflows directory (.github/workflows
) is where you put programmable bots that run on every commit, merge, or even cron jobs. You can also store GitHub Copilot instructions and other GitHub-specific configurations here.
Dev Container (.devcontainer/
)
This directory contains configuration files for Visual Studio Code's Dev Containers feature. It allows you to define a consistent development environment that can be shared across your team, complete with all necessary dependencies, extensions, and settings.
Documentation (docs/
)
This is the place where you would put documentation surrounding your repository. It is also the place that commonly-used notebooks are placed, making it easy to use as reference material.
Documentation Framework
The Diataxis framework is an excellent resource for structuring documentation. I would strongly recommend checking it out if you don't already have a framework that you use for structuring docs.
Source Code ({{ cookiecutter.__module_name }}/
)
This is the source directory. In a Python-centric project, we would usually name it after the project name in snake_case
format. This is where all your project's Python modules live.
Tests (tests/
)
This directory houses software tests. If you're new to testing, don't fret! we'll address the importance of software tests and how to start writing them in the testing chapter. This directory can mostly be ignored in the early stages of a project, but as the code stabilizes and you start to need guarantees that your code is correct, this directory will become important.
Configuration Files (Various)
Several configuration files live in the root directory:
.bumpversion.cfg
: Configures version bumping for your project.flake8
: Configuration for the Flake8 linter.gitignore
: Tells Git which files to ignore.pre-commit-config.yaml
: Configures pre-commit hooks for code quality checksMANIFEST.in
: Specifies additional files to include in Python package distributionsmkdocs.yaml
: Configuration for MkDocs documentation generatorpyproject.toml
: Central configuration file for Python project settings, build system requirements, and tool configurations
Documentation Files
README.md
: The first document users see when visiting your repository. Should contain setup instructions and basic project information.
Automate the scaffolding of new projects
So a standard project structure is really awesome, but it's also tedious to scaffold. Who wants to dedicate precious working memory to manually creating each directory and configuration file by hand? Perhaps me, but (a) even this masochist has his limits, and (b) it's just much more efficient if we could automate the scaffolding of new project codebases.
To that end, there are three options you can consider.
Option 1 - Template Repositories: This option involves creating template repositories on services like GitHub. The advantage of this is that it's incredibly easy to bootstrap a new code repository within the web UI, making it an easy on-ramp choice for less experienced or seasoned data science teams. However, solely relying on template repositories means we miss out on automation on development machines to set up the computational environment, which is equally important.
Option 2 - Cookiecutter Repositories: This option involves creating a cookiecutter repository. This has similar advantages to Option 1, but it also bakes in the ability to execute custom environment setup code through pre-generation and post-generation hooks, thereby automating not only the creation of a project that is scaffolded in a standard fashion but also computing environments that are standardized. The disadvantage of going this route, however, is that it takes some degree of expertise to set up
Option 3 - Custom CLIs: This option involves creating a custom command line tool that provides a unified interface to multiple tools and systems. Examples of this include my own pyds-cli
, which uses cookiecutter
underneath the hood but also adds additional custom commands for managing data science projects, such as cleaning out environments and rebuilding them from scratch.
The real advantage of custom CLIs isn't just project scaffolding—cookiecutter with pre-generation and post-generation bash scripts can handle that perfectly well. Instead, custom CLIs shine when you need to:
- Provide a unified interface to multiple tools (project initialization, environment setup, system logins, etc.)
- Create branded tooling that combines project management with environment management
- Automate complex workflows that span multiple systems or require authentication
- Offer a consistent developer experience across different types of operations
The disadvantage is that a higher skill level is needed compared to the first two options to maintain the CLI and continuously evolve it. That said, with great documentation and the establishment of easily followable patterns within the CLI codebase, it is possible to lower the skill barrier to contribution!
So which option should you go with? The choice depends on the skill level of your organization. I view the three levels as a progression. It is perfectly fine to introduce your organization to template repositories first before moving onto more sophisticated automation with cookiecutter repositories, and only progress to custom CLIs once you need a unified interface to multiple tools and systems. At the risk of tooting my own horn too loudly, I would love it if you would try out pyds-cli
and see how it can be further customized!