Categorize everything that you can
Categorize
Categorize everything that you can in your projects.
This philosophy is best reflected in software development: The best software developers are masters of organization. If you go into a GitHub repository and browse a few well-structured projects, you'll easily glean this point. These projects keep things simple, are modular, have awesome documentation, and rely on single sources of truth for everything. Plain, unambiguous, and organized -- these are the best adjectives to describe them.
At the core of this philosophy is the fact that these developers have thought carefully about categories of things. You can think of a project as being composed of a series of categories of distinct entities: data, notebooks, scripts, source code, and more. They relate to each other in unique ways: data are consumed by notebooks, notebooks import source code, etc. If we're extremely clear about the categories of things that exist for our project, and strive to cleanly describe the relationships between these categories of things, then our projects will become very well-organized.
I believe data science projects ought to be organized the same way. Especially if they are collaborative projects involving more than one person. As such, it should be possible for us to adopt a sane way of working that is highly inspired from the software development world. We thus inject structure into our projects.
Now, structure for the sake of structure is pointless; structure should exist for our utilitarian benefit. We impose a particular file structure so that we can navigate through it and find what we want quickly. We structure our source code so that we can find what we need more easily. With clearly defined categories of things and their relationships, we can more cleanly collaborate with others.
See this philosophy in action
Good categorization is visible in well-organized projects I've worked on:
Project structure that scales
The Repository structure shows categorization in action - docs/
, tests/
, src/
, each with a clear purpose. No more "where does this file go?" decisions that slow down your thinking.
Naming consistency across systems
Project naming demonstrates how one project name becomes consistent across git repo (sales-forecast-2020
), conda environment (sales_forecast_2020
), and Python package (sales_forecast_2020
). Same concept, appropriate format for each system. The beauty of this approach is that you stop context-switching between naming conventions.
Code organization by function
Source code organization shows how I categorize code by purpose: data_loaders.py
, preprocessing.py
, models.py
, utils.py
. Each module has a clear, single responsibility. What are the advantages of this approach? You and your collaborators always know where to find functionality.
Environment categories
Pixi features demonstrates environment categorization - tests
, docs
, cuda
, default
features that combine into purpose-built environments. No more monolithic environment files that take forever to resolve.
Documentation hierarchy
Look at this book's table of contents - Philosophies, Machine Setup, Shell Configuration, Projects, Ways of Working. Each section has a clear scope and purpose. Time will distill the best practices for your specific context, but this structure gives you a starting point.
The result: You and your collaborators always know where everything belongs.