Structure your source code repository sanely
Almost every data science project typically starts in a Jupyter notebook. This is natural - you're exploring data, testing hypotheses, and building initial models. However, as your project matures, you'll need to transition from exploratory notebooks to maintainable code. Here's how to make that journey smooth and productive.
For the philosophical foundation behind this approach, see the single source of truth principle. This principle guides how we organize code to avoid confusion and conflicts.
For organizing your data access patterns, see how to create a data catalog. Centralizing data loading functions is a key part of maintaining a single source of truth.
For managing your project's version control, see repository organization best practices. A well-structured repository supports the evolution of your source code.
Phase 1: Initial Exploration
Start with your notebooks in a notebooks/
directory. Don't worry about structure yet - focus on understanding your data and problem space. When you find yourself copying code between notebooks, that's your first signal to start organizing your code.
Phase 2: Emerging Patterns
As patterns emerge, start moving reusable code into a proper Python package structure:
myproject/
├── __init__.py
├── data_loaders.py # Data loading functions from your notebooks
├── preprocessing.py # Data cleaning patterns you've developed
├── models.py # ML models you're experimenting with
└── utils.py # Helper functions used across notebooks
Don't aim for perfection - just move code that you're reusing. Keep your notebook cells focused on analysis rather than implementation details.
Phase 3: One-off Scripts
Sometimes you'll need to run a specific analysis or data processing task. Rather than buried in a notebook, create a script in a scripts/
directory. These scripts use inline script metadata to define their own execution environment, making them completely self-contained.
Here's an example of a well-structured script:
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "pandas>=2.0.0",
# "scikit-learn>=1.0.0"
# ]
# description = "Retrain model on latest data and export metrics"
# ///
"""
Retrain model on latest data and export metrics.
This script loads the latest training data, retrains the model,
and exports performance metrics to a JSON file.
Usage:
uv run scripts/retrain_model.py
"""
from pathlib import Path
from myproject.data_loaders import load_latest_data
from myproject.models import train_model
from myproject.utils import export_metrics
# Load and process data
data = load_latest_data()
model = train_model(data)
export_metrics(model, Path("outputs/metrics.json"))
The key components of this script:
-
Inline metadata (lines 1-7): This inline script metadata block defines the script's requirements - Python version, dependencies, and description. This allows the script to be executed in its own isolated environment.
-
Module-level documentation (lines 9-16): Clear documentation explaining what the script does and how to run it with
uv run
. -
Self-contained execution: You can run this script anywhere with just
uv run scripts/retrain_model.py
- no need to activate a specific environment or install dependencies manually.
This approach makes your one-off scripts reproducible and portable, as they carry their own environment requirements with them.
Phase 4: Production Structure
As your project matures, you'll want a more complete structure:
myproject/
├── __init__.py
├── cli.py # Command-line tools for routine tasks
├── data_loaders.py # All data ingestion logic
├── models.py # ML/Bayesian models
├── preprocessing.py # Data preprocessing pipelines
├── schemas.py # Data validation schemas
└── utils.py # Utility functions
scripts/ # One-off analysis scripts
tests/ # Unit tests for your package
notebooks/ # Original exploration notebooks
Leveraging AI Assistants
Modern AI tools can significantly accelerate your development. Set up AI assistant configurations to maintain consistency. I include examples of these below, starting first with GitHub Copilot's .github/copilot-instructions.md
file:
<!-- :.github/copilot-instructions.md -->
# Coding Standards
- Always add type hints to function parameters
- Write docstrings in Sphinx format
- Create unit tests for new functions in tests/ directory
- Follow functional programming principles where possible
And then for Cursor's .cursorrules
file:
<!-- .cursorrules -->
Docstrings should be in Sphinx format.
Testing framework should be pytest.
Type hints should be required.
These configurations help ensure that AI-generated code follows your project's standards.
Core Development Principles
Throughout this evolution, maintain these principles:
Write Everything Twice (WET): Don't over-engineer early. When you catch yourself copying code a second time, that's when you refactor.
Document As You Code: Write docstrings immediately. AI assistants can help, but you should review and adjust them:
def preprocess_features(df: pd.DataFrame) -> pd.DataFrame:
"""Preprocess feature columns for model training.
:param df: Raw input DataFrame
:return: DataFrame with processed features
"""
# Your implementation
Design Data Structures First: Good data structures simplify your algorithms. For example:
# Complex with basic structure
def find_overlapping_categories(list1, list2):
return [x for x in list1 if x in list2]
# Simple with better structure
def find_overlapping_categories(items1, items2):
return set(items1) & set(items2)
Evolve Gradually: Let your project structure grow with your needs. Start simple and refactor as patterns emerge.
The goal isn't perfect structure from day one. Instead, you want to have a consistent, maintainable codebase that grows naturally with your project's needs. With the above principles in mind, you'll be able to write code that is both functional and easy to understand, even as your project evolves.