How to write software tests
I'll be honest with you: I used to think testing was a waste of time. "I'll just be careful," I told myself. "I know my code works."
Then came graduate school, and my thesis project changed everything. I had built this complex analysis pipeline that took a full week to run from start to finish. Every time I wanted to make even a small change - maybe fix a bug or try a different approach - I had to run the entire pipeline to make sure I hadn't broken anything. One week to see if my change worked. One week to discover I'd introduced a bug three functions away from where I made the change.
I was paralyzed. I couldn't make progress because I didn't have confidence in my code. The feedback loop was so slow that refactoring became impossible. That's when I learned that testing isn't about being a perfectionist; it's about having the confidence to make your code better.
Everything below follows from that premise. Before I walk through the concrete patterns, there is one modern friction point that can undo the whole point of fast feedback.
Using AI coding agents for tests
AI coding agents are remarkably good at writing tests, but they need
guardrails. I always instruct my agents in
Building repository memory with AGENTS.md to never modify
a test to make it pass. They must always modify the implementation to satisfy
the test. By enforcing this norm, you ensure the agent builds a robust safety
net rather than a suite of "cheating" tests.
What this chapter covers (and what it skips)
This chapter stays close to habits that keep code and data contracts honest from one day to the next: pytest, fixtures, schemas, and a test layout that survives refactoring. That is where most data scientists get the fastest return.
Production ML observability sits outside that frame. Drift monitors, shadow deployments, and online experiment design belong with serving and product loops, not with the baseline testing skills I wish every DS repository had. When the skills overview mentions validating models, it means targeted smoke checks and invariants on small data that you can run in CI; the dedicated section below spells out what that looks like. It does not mean a full MLOps monitoring program.
Later, in the workflow section, I tie the same local test commands to CI and to documentation that should move when behavior changes.
Why should I bother writing tests?
Stated plainly, the thesis I took from that pipeline is this: tests enable refactoring, and refactoring is how good code evolves. Without tests, your code freezes in its initial, imperfect shape because every edit feels like a blind bet.
Data science code has a sneaky way of growing more complex than you originally planned. You start with a simple analysis script, then you add a helper function, then another, and before you know it, you've got a mini-library that other parts of your project depend on.
As that happens, you stop writing code once and walking away. You swap in a cheaper algorithm, harden edge cases you ignored at first, or reshape an interface now that callers exist. Tests turn those edits into bounded experiments: a red bar localizes the mistake instead of leaving you to infer it from downstream weirdness.
Beyond catching regressions, I've found tests serve as documentation that actually stays up-to-date. When I come back to code I wrote six months ago, the tests remind me exactly how functions are supposed to behave.
How do I actually write tests?
The general pattern is beautifully simple:
- Write a function that does something meaningful (not just wrapping another function, but actual work)
- Test that function by giving it examples and checking that it produces the expected results
My go-to test runner: pytest
I've tried various testing frameworks over the years, but I always come back to
pytest. It's incredibly easy to get started with, yet powerful enough to
handle complex testing scenarios as your projects grow.
Getting started is as simple as:
pixi add pytest # or pip install pytest if you're using pip
What do real tests look like?
With pytest wired in, the useful part is what you prove with it. These blocks mirror the uneven shape that typical data science code takes once production and experimentation meet.
Testing data science functions
# In src/myproject/feature_engineering.py
from typing import List
import numpy as np
def normalize_features(values: List[float], method: str = "standard") -> List[float]:
"""Normalize a list of numerical features.
Args:
values: List of numerical values to normalize.
method: 'standard' for z-score, 'minmax' for min-max scaling.
Returns:
List of normalized values.
"""
if not values:
return []
values_array = np.array(values)
if method == "standard":
if np.std(values_array) == 0:
return [0.0] * len(values)
return ((values_array - np.mean(values_array)) / np.std(values_array)).tolist()
elif method == "minmax":
min_val, max_val = np.min(values_array), np.max(values_array)
if min_val == max_val:
return [0.0] * len(values)
return ((values_array - min_val) / (max_val - min_val)).tolist()
raise ValueError(f"Unknown normalization method: {method}")
def extract_time_features(timestamp: str) -> dict:
"""Extract time-based features from a timestamp.
Args:
timestamp: ISO format timestamp string.
Returns:
Dictionary with extracted features.
"""
from datetime import datetime
dt = datetime.fromisoformat(timestamp.replace("Z", "+00:00"))
return {
"hour": dt.hour,
"day_of_week": dt.weekday(),
"is_weekend": dt.weekday() >= 5,
"quarter": (dt.month - 1) // 3 + 1,
}
These functions are more representative of what we actually do - they handle edge cases, have multiple code paths, and involve some complexity. Here's how I test them:
# In tests/test_feature_engineering.py
import numpy as np
import pytest
from src.myproject.feature_engineering import extract_time_features, normalize_features
def test_normalize_features_standard():
"""Test standard normalization."""
values = [1.0, 2.0, 3.0, 4.0, 5.0]
result = normalize_features(values, "standard")
# Should have mean ~0 and std ~1
assert abs(np.mean(result)) < 1e-10
assert abs(np.std(result) - 1.0) < 1e-10
def test_normalize_features_minmax():
"""Test min-max normalization."""
values = [10.0, 20.0, 30.0]
result = normalize_features(values, "minmax")
# Should be scaled to [0, 1]
assert result == [0.0, 0.5, 1.0]
def test_normalize_features_edge_cases():
"""Test edge cases for normalization."""
# Empty list
assert normalize_features([]) == []
# All same values
assert normalize_features([5.0, 5.0, 5.0], "standard") == [0.0, 0.0, 0.0]
assert normalize_features([5.0, 5.0, 5.0], "minmax") == [0.0, 0.0, 0.0]
# Invalid method
with pytest.raises(ValueError, match="Unknown normalization method"):
normalize_features([1, 2, 3], "invalid")
def test_extract_time_features():
"""Test time feature extraction."""
# Tuesday, January 3rd, 2023 at 2:30 PM
timestamp = "2023-01-03T14:30:00Z"
result = extract_time_features(timestamp)
expected = {
"hour": 14,
"day_of_week": 1, # Tuesday
"is_weekend": False,
"quarter": 1,
}
assert result == expected
def test_extract_time_features_weekend():
"""Test weekend detection."""
# Saturday
timestamp = "2023-01-07T10:00:00Z"
result = extract_time_features(timestamp)
assert result["is_weekend"] is True
assert result["day_of_week"] == 5
Notice how these tests cover the kinds of issues that actually matter in data science: handling edge cases like empty inputs, all-identical values, and invalid parameters. This is the level of complexity where testing really pays off.
What about testing DataFrames?
You might notice I'm not showing examples of testing pandas DataFrames directly. That's intentional! For DataFrame validation - checking that your data has the right columns, types, and value ranges - I recommend using specialized tools like Pandera rather than writing custom tests.
Pandera lets you define schemas that validate your data automatically:
import pandera as pa
from pandera import Column, Check
# Define what your DataFrame should look like
schema = pa.DataFrameSchema({
"age": Column(int, checks=Check.between(0, 120)),
"salary": Column(float, checks=Check.greater_than(0)),
"name": Column(str, checks=Check.str_length(1, 50)),
})
# Validate your DataFrame
validated_df = schema(df) # Raises error if validation fails
This is much more powerful and maintainable than writing custom assertion-based tests for data validation.
Using fixtures to avoid repetitive setup
One thing I learned the hard way is that creating test data over and over gets tedious. Pytest fixtures solve this beautifully:
# In tests/conftest.py
import pandas as pd
import pytest
@pytest.fixture
def sample_dataframe():
"""Create a sample DataFrame for testing."""
return pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"salary": [50000, 60000, 70000],
})
@pytest.fixture
def messy_dataframe():
"""Create a DataFrame with missing values."""
return pd.DataFrame({
"A": [1, 2, None, 4, 5],
"B": [None, 2, 3, 4, None],
"C": [1, 2, 3, 4, 5],
})
Now I can use these fixtures in any test:
# In tests/test_with_fixtures.py
def test_with_sample_data(sample_dataframe):
"""Test using the sample DataFrame fixture."""
assert len(sample_dataframe) == 3
assert "name" in sample_dataframe.columns
assert sample_dataframe["age"].mean() == 30
def test_cleaning_with_messy_data(messy_dataframe):
"""Test cleaning with messy data fixture."""
clean_df = clean_dataframe(messy_dataframe)
assert len(clean_df) == 2 # Only 2 rows have no NaN values
When those table-shaped setups repeat across files, pytest fixtures prevent silent drift; once the setup is centralized, it's easier to articulate what you assume that data guarantees.
What about testing data assumptions?
This is something I wish I'd learned earlier: you should test not just your functions, but your assumptions about the data itself:
def test_data_validation(sample_dataframe):
"""Test that data meets expected criteria."""
# Test data types
assert sample_dataframe["age"].dtype == "int64"
assert sample_dataframe["salary"].dtype == "int64"
# Test value ranges
assert (sample_dataframe["age"] >= 0).all()
assert (sample_dataframe["salary"] >= 0).all()
# Test for required columns
required_columns = ["name", "age", "salary"]
assert all(col in sample_dataframe.columns for col in required_columns)
# Test for no missing values in critical columns
assert sample_dataframe["name"].notna().all()
These kinds of tests have saved me countless times when data sources change unexpectedly.
Reproducibility habits that tests can enforce
Data validation catches bad inputs; randomness and floating point catch a different failure mode, the kind that shows up as "it passed on my laptop" and as red CI with no obvious culprit. A few patterns cover most of what I actually need in a data science repo:
- Controlled randomness in fixtures. If a test trains anything stochastic,
construct a
numpy.random.Generator(or equivalent) with a fixed seed inside the fixture and inject it into the code under test when the API allows it. If it does not, set the seed at the start of that test module and accept that you are testing one trajectory; document that choice. - Float comparisons. Prefer
numpy.testing.assert_allclose(or pytest approx helpers) instead of exact equality on floats. - Same environment shape as CI. When tests fail only in CI, the usual
suspects are dependency versions, missing env vars, and filesystem paths.
Pinning environments (for example with Pixi or a lockfile) and running
pixi run testlocally before pushing closes most of that gap; the project chapter on repository structure ties those files to how the team works together.
None of this replaces a formal reproducibility study for a paper, but it does keep your automated checks from lying to you.
Smoke checks for models and training code
Even with reproducibility handled, you still want proof that fit and predict run
end to end. Once you have a models/ package (or a training script), tests can
stay small and fast if you treat "fit + predict on toy data" as a contract,
not as a full evaluation of model quality.
Useful checks I have actually kept green in CI:
- Train on a tiny slice (dozens of rows, few features) and assert the training step completes without NaNs in loss or metrics you log.
- Thresholds on toy data that are loose enough to survive harmless refactors but tight enough to catch broken wiring; for example accuracy strictly above chance on a linearly separable micro-dataset, or RMSE below a generous ceiling on synthetic noise.
- Shape and type contracts on outputs from
predictortransform: expected number of rows, column names for DataFrame outputs, dtypes for arrays. - Deterministic inference path when the model is fixed: save a small on-disk fixture model or coefficients once, load in tests, and assert predictions match reference values within tolerance.
I still avoid checking in huge golden prediction files unless the team agrees to maintain them; a few numbers from a minimal example beat a brittle blob.
Don't forget to test error conditions
Your functions should fail gracefully, and you should test that they do:
def divide_numbers(a, b):
"""Divide two numbers."""
if b == 0:
raise ValueError("Cannot divide by zero")
return a / b
def test_divide_by_zero():
"""Test that division by zero raises appropriate error."""
with pytest.raises(ValueError, match="Cannot divide by zero"):
divide_numbers(10, 0)
def test_divide_normal():
"""Test normal division."""
assert divide_numbers(10, 2) == 5.0
How do I actually run these tests?
The beauty of pytest is its simplicity:
# Run all tests
pytest
# Run tests with more details
pytest -v
# Run just one test file
pytest tests/test_math_utils.py
# Run just one specific test
pytest tests/test_math_utils.py::test_add_numbers
Configuring pytest for your project
I like to configure pytest once and forget about it. You can do this in your
pyproject.toml:
[tool.pytest.ini_options]
addopts = "-v --tb=short"
This tells pytest how to behave on every run. If you use an import layout like
src/myproject, also set pythonpath (or pytest's norecursedirs) so imports
resolve cleanly.
What are some advanced patterns worth knowing?
Parametrized tests - testing multiple scenarios efficiently
Sometimes you want to test the same function with many different inputs. Parametrized tests are perfect for this:
@pytest.mark.parametrize("a,b,expected", [
(1, 2, 3),
(0, 0, 0),
(-1, 1, 0),
(10, -5, 5),
])
def test_add_numbers_parametrized(a, b, expected):
"""Test add_numbers with multiple parameter sets."""
assert add_numbers(a, b) == expected
This runs four separate tests with different inputs, but with much less code duplication.
Testing file operations
Sometimes you need to test functions that read or write files. Here's a safe way to do it:
import os
import tempfile
def test_file_processing():
"""Test function that processes files."""
with tempfile.NamedTemporaryFile(mode="w", delete=False) as f:
f.write("test,data\n1,2\n3,4\n")
temp_file = f.name
try:
# Test your file processing function
result = process_csv_file(temp_file)
assert len(result) == 2
finally:
os.unlink(temp_file)
This creates a temporary file, tests your function with it, then cleans up automatically.
How do I organize my tests?
I've found it's easiest to mirror your source code structure in your test directory. This makes it obvious where to find tests for any given module:
src/
└── myproject/
├── __init__.py
├── feature_engineering.py
├── models/
│ ├── __init__.py
│ ├── linear_model.py
│ └── tree_model.py
└── utils/
├── __init__.py
├── data_loader.py
└── metrics.py
tests/
├── conftest.py # Shared fixtures
├── test_feature_engineering.py # Tests for feature_engineering.py
├── models/
│ ├── test_linear_model.py # Tests for models/linear_model.py
│ └── test_tree_model.py # Tests for models/tree_model.py
├── utils/
│ ├── test_data_loader.py # Tests for utils/data_loader.py
│ └── test_metrics.py # Tests for utils/metrics.py
└── integration/
└── test_full_pipeline.py # End-to-end tests
This structure makes it immediately clear which test file corresponds to which
source module. When I'm working on src/myproject/models/linear_model.py, I
know exactly where to find its tests: tests/models/test_linear_model.py.
I put shared fixtures in conftest.py files at the appropriate level - the
root conftest.py for project-wide fixtures, and subdirectory conftest.py
files for module-specific shared fixtures.
After folders make ownership obvious, the remaining readability wins come from how you name behaviors and arrange each test body.
What about test naming?
I use descriptive names that tell me exactly what's being tested:
test_calculate_mean_with_empty_listtest_clean_dataframe_removes_nan_rowstest_divide_by_zero_raises_error
The pattern I follow is:
test_function_name_when_condition_then_expected_result
How do I structure individual tests?
I follow the AAA pattern - Arrange, Act, Assert:
def test_function_name():
"""Test description."""
# Arrange: Set up test data
data = [1, 2, 3, 4, 5]
# Act: Execute the function
result = calculate_mean(data)
# Assert: Check the result
assert result == 3.0
This makes tests easy to read and understand.
How do I know if I'm testing enough?
I use pytest-cov to check test coverage:
# Install coverage
pixi add pytest-cov
# Run tests with coverage
pytest --cov=src
# Generate a nice HTML report
pytest --cov=src --cov-report=html
But here's the thing: 100% coverage doesn't mean your tests are perfect. It just means every line of code gets executed. Focus on testing the behavior that matters.
How does this fit into my development workflow?
I add tasks to pyproject.toml under Pixi's [tool.pixi.tasks] block so I can
run tests without remembering long commands:
[tool.pixi.tasks]
test = { cmd = "pytest" }
test-cov = { cmd = "pytest --cov=src --cov-report=term-missing" }
Then:
pixi run test
That same convention ports cleanly into CI on every pull request. This is the place where the scope note above becomes concrete: for how those workflows are structured and what belongs in YAML versus in narrative docs, see Use CI/CD to automate tasks and Store your project documentation in your project repository.
What mistakes should I avoid?
Here are the testing antipatterns I've learned to avoid:
- Testing implementation details - Test what the function does, not how it does it
- Overly complex tests - If your test is hard to understand, it's probably testing too much
- Testing external dependencies - Mock APIs and databases; test your code, not theirs
- Skipping edge cases - Empty inputs, zero values, negative numbers - these are where bugs hide
- Tests that depend on each other - Each test should be able to run independently
How do I get started?
Don't try to test everything at once. Start small:
- Pick one important function that you rely on
- Write a simple test that checks it works with normal inputs
- Add a test for an edge case
- Run the tests and watch them pass
- Make a small change to your function and watch the tests catch any problems
The goal isn't perfect test coverage from day one. It's building confidence in your code, one test at a time.
Remember: tests aren't about catching every possible bug. They're about catching the bugs that would otherwise ruin your day. Start with the functions you care most about, and grow your test suite as your code grows.
Want to dive deeper? I've written a longer essay on testing for data scientists that goes into more detail on testing philosophy and advanced patterns.