How to write software tests

I'll be honest with you: I used to think testing was a waste of time. "I'll just be careful," I told myself. "I know my code works."

Then came graduate school, and my thesis project changed everything. I had built this complex analysis pipeline that took a full week to run from start to finish. Every time I wanted to make even a small change - maybe fix a bug or try a different approach - I had to run the entire pipeline to make sure I hadn't broken anything. One week to see if my change worked. One week to discover I'd introduced a bug three functions away from where I made the change.

I was paralyzed. I couldn't make progress because I didn't have confidence in my code. The feedback loop was so slow that refactoring became impossible. That's when I learned that testing isn't about being a perfectionist; it's about having the confidence to make your code better.

Everything below follows from that premise. Before I walk through the concrete patterns, there is one modern friction point that can undo the whole point of fast feedback.

Using AI coding agents for tests

AI coding agents are remarkably good at writing tests, but they need guardrails. I always instruct my agents in Building repository memory with AGENTS.md to never modify a test to make it pass. They must always modify the implementation to satisfy the test. By enforcing this norm, you ensure the agent builds a robust safety net rather than a suite of "cheating" tests.

What this chapter covers (and what it skips)

This chapter stays close to habits that keep code and data contracts honest from one day to the next: pytest, fixtures, schemas, and a test layout that survives refactoring. That is where most data scientists get the fastest return.

Production ML observability sits outside that frame. Drift monitors, shadow deployments, and online experiment design belong with serving and product loops, not with the baseline testing skills I wish every DS repository had. When the skills overview mentions validating models, it means targeted smoke checks and invariants on small data that you can run in CI; the dedicated section below spells out what that looks like. It does not mean a full MLOps monitoring program.

Later, in the workflow section, I tie the same local test commands to CI and to documentation that should move when behavior changes.

Why should I bother writing tests?

Stated plainly, the thesis I took from that pipeline is this: tests enable refactoring, and refactoring is how good code evolves. Without tests, your code freezes in its initial, imperfect shape because every edit feels like a blind bet.

Data science code has a sneaky way of growing more complex than you originally planned. You start with a simple analysis script, then you add a helper function, then another, and before you know it, you've got a mini-library that other parts of your project depend on.

As that happens, you stop writing code once and walking away. You swap in a cheaper algorithm, harden edge cases you ignored at first, or reshape an interface now that callers exist. Tests turn those edits into bounded experiments: a red bar localizes the mistake instead of leaving you to infer it from downstream weirdness.

Beyond catching regressions, I've found tests serve as documentation that actually stays up-to-date. When I come back to code I wrote six months ago, the tests remind me exactly how functions are supposed to behave.

How do I actually write tests?

The general pattern is beautifully simple:

Write a function that does something meaningful (not just wrapping another function, but actual work)
Test that function by giving it examples and checking that it produces the expected results

My go-to test runner: pytest

I've tried various testing frameworks over the years, but I always come back to pytest. It's incredibly easy to get started with, yet powerful enough to handle complex testing scenarios as your projects grow.

Getting started is as simple as:

pixi add pytest  # or pip install pytest if you're using pip

What do real tests look like?

With pytest wired in, the useful part is what you prove with it. These blocks mirror the uneven shape that typical data science code takes once production and experimentation meet.

Testing data science functions

# In src/myproject/feature_engineering.py
from typing import List

import numpy as np


def normalize_features(values: List[float], method: str = "standard") -> List[float]:
    """Normalize a list of numerical features.

    Args:
        values: List of numerical values to normalize.
        method: 'standard' for z-score, 'minmax' for min-max scaling.

    Returns:
        List of normalized values.
    """
    if not values:
        return []

    values_array = np.array(values)

    if method == "standard":
        if np.std(values_array) == 0:
            return [0.0] * len(values)
        return ((values_array - np.mean(values_array)) / np.std(values_array)).tolist()
    elif method == "minmax":
        min_val, max_val = np.min(values_array), np.max(values_array)
        if min_val == max_val:
            return [0.0] * len(values)
        return ((values_array - min_val) / (max_val - min_val)).tolist()
    raise ValueError(f"Unknown normalization method: {method}")


def extract_time_features(timestamp: str) -> dict:
    """Extract time-based features from a timestamp.

    Args:
        timestamp: ISO format timestamp string.

    Returns:
        Dictionary with extracted features.
    """
    from datetime import datetime

    dt = datetime.fromisoformat(timestamp.replace("Z", "+00:00"))

    return {
        "hour": dt.hour,
        "day_of_week": dt.weekday(),
        "is_weekend": dt.weekday() >= 5,
        "quarter": (dt.month - 1) // 3 + 1,
    }

These functions are more representative of what we actually do - they handle edge cases, have multiple code paths, and involve some complexity. Here's how I test them:

# In tests/test_feature_engineering.py
import numpy as np
import pytest

from src.myproject.feature_engineering import extract_time_features, normalize_features


def test_normalize_features_standard():
    """Test standard normalization."""
    values = [1.0, 2.0, 3.0, 4.0, 5.0]
    result = normalize_features(values, "standard")

    # Should have mean ~0 and std ~1
    assert abs(np.mean(result)) < 1e-10
    assert abs(np.std(result) - 1.0) < 1e-10


def test_normalize_features_minmax():
    """Test min-max normalization."""
    values = [10.0, 20.0, 30.0]
    result = normalize_features(values, "minmax")

    # Should be scaled to [0, 1]
    assert result == [0.0, 0.5, 1.0]


def test_normalize_features_edge_cases():
    """Test edge cases for normalization."""
    # Empty list
    assert normalize_features([]) == []

    # All same values
    assert normalize_features([5.0, 5.0, 5.0], "standard") == [0.0, 0.0, 0.0]
    assert normalize_features([5.0, 5.0, 5.0], "minmax") == [0.0, 0.0, 0.0]

    # Invalid method
    with pytest.raises(ValueError, match="Unknown normalization method"):
        normalize_features([1, 2, 3], "invalid")


def test_extract_time_features():
    """Test time feature extraction."""
    # Tuesday, January 3rd, 2023 at 2:30 PM
    timestamp = "2023-01-03T14:30:00Z"
    result = extract_time_features(timestamp)

    expected = {
        "hour": 14,
        "day_of_week": 1,  # Tuesday
        "is_weekend": False,
        "quarter": 1,
    }
    assert result == expected


def test_extract_time_features_weekend():
    """Test weekend detection."""
    # Saturday
    timestamp = "2023-01-07T10:00:00Z"
    result = extract_time_features(timestamp)

    assert result["is_weekend"] is True
    assert result["day_of_week"] == 5

Notice how these tests cover the kinds of issues that actually matter in data science: handling edge cases like empty inputs, all-identical values, and invalid parameters. This is the level of complexity where testing really pays off.

What about testing DataFrames?

You might notice I'm not showing examples of testing pandas DataFrames directly. That's intentional! For DataFrame validation - checking that your data has the right columns, types, and value ranges - I recommend using specialized tools like Pandera rather than writing custom tests.

Pandera lets you define schemas that validate your data automatically:

import pandera as pa
from pandera import Column, Check

# Define what your DataFrame should look like
schema = pa.DataFrameSchema({
    "age": Column(int, checks=Check.between(0, 120)),
    "salary": Column(float, checks=Check.greater_than(0)),
    "name": Column(str, checks=Check.str_length(1, 50)),
})

# Validate your DataFrame
validated_df = schema(df)  # Raises error if validation fails

This is much more powerful and maintainable than writing custom assertion-based tests for data validation.

Using fixtures to avoid repetitive setup

One thing I learned the hard way is that creating test data over and over gets tedious. Pytest fixtures solve this beautifully:

# In tests/conftest.py
import pandas as pd
import pytest


@pytest.fixture
def sample_dataframe():
    """Create a sample DataFrame for testing."""
    return pd.DataFrame({
        "name": ["Alice", "Bob", "Charlie"],
        "age": [25, 30, 35],
        "salary": [50000, 60000, 70000],
    })


@pytest.fixture
def messy_dataframe():
    """Create a DataFrame with missing values."""
    return pd.DataFrame({
        "A": [1, 2, None, 4, 5],
        "B": [None, 2, 3, 4, None],
        "C": [1, 2, 3, 4, 5],
    })

Now I can use these fixtures in any test:

# In tests/test_with_fixtures.py
def test_with_sample_data(sample_dataframe):
    """Test using the sample DataFrame fixture."""
    assert len(sample_dataframe) == 3
    assert "name" in sample_dataframe.columns
    assert sample_dataframe["age"].mean() == 30


def test_cleaning_with_messy_data(messy_dataframe):
    """Test cleaning with messy data fixture."""
    clean_df = clean_dataframe(messy_dataframe)
    assert len(clean_df) == 2  # Only 2 rows have no NaN values

When those table-shaped setups repeat across files, pytest fixtures prevent silent drift; once the setup is centralized, it's easier to articulate what you assume that data guarantees.

What about testing data assumptions?

This is something I wish I'd learned earlier: you should test not just your functions, but your assumptions about the data itself:

def test_data_validation(sample_dataframe):
    """Test that data meets expected criteria."""
    # Test data types
    assert sample_dataframe["age"].dtype == "int64"
    assert sample_dataframe["salary"].dtype == "int64"

    # Test value ranges
    assert (sample_dataframe["age"] >= 0).all()
    assert (sample_dataframe["salary"] >= 0).all()

    # Test for required columns
    required_columns = ["name", "age", "salary"]
    assert all(col in sample_dataframe.columns for col in required_columns)

    # Test for no missing values in critical columns
    assert sample_dataframe["name"].notna().all()

These kinds of tests have saved me countless times when data sources change unexpectedly.

Reproducibility habits that tests can enforce

Data validation catches bad inputs; randomness and floating point catch a different failure mode, the kind that shows up as "it passed on my laptop" and as red CI with no obvious culprit. A few patterns cover most of what I actually need in a data science repo:

Controlled randomness in fixtures. If a test trains anything stochastic, construct a numpy.random.Generator (or equivalent) with a fixed seed inside the fixture and inject it into the code under test when the API allows it. If it does not, set the seed at the start of that test module and accept that you are testing one trajectory; document that choice.
Float comparisons. Prefer numpy.testing.assert_allclose (or pytest approx helpers) instead of exact equality on floats.
Same environment shape as CI. When tests fail only in CI, the usual suspects are dependency versions, missing env vars, and filesystem paths. Pinning environments (for example with Pixi or a lockfile) and running pixi run test locally before pushing closes most of that gap; the project chapter on repository structure ties those files to how the team works together.

None of this replaces a formal reproducibility study for a paper, but it does keep your automated checks from lying to you.

Smoke checks for models and training code

Even with reproducibility handled, you still want proof that fit and predict run end to end. Once you have a models/ package (or a training script), tests can stay small and fast if you treat "fit + predict on toy data" as a contract, not as a full evaluation of model quality.

Useful checks I have actually kept green in CI:

Train on a tiny slice (dozens of rows, few features) and assert the training step completes without NaNs in loss or metrics you log.
Thresholds on toy data that are loose enough to survive harmless refactors but tight enough to catch broken wiring; for example accuracy strictly above chance on a linearly separable micro-dataset, or RMSE below a generous ceiling on synthetic noise.
Shape and type contracts on outputs from predict or transform: expected number of rows, column names for DataFrame outputs, dtypes for arrays.
Deterministic inference path when the model is fixed: save a small on-disk fixture model or coefficients once, load in tests, and assert predictions match reference values within tolerance.

I still avoid checking in huge golden prediction files unless the team agrees to maintain them; a few numbers from a minimal example beat a brittle blob.

Don't forget to test error conditions

Your functions should fail gracefully, and you should test that they do:

def divide_numbers(a, b):
    """Divide two numbers."""
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b


def test_divide_by_zero():
    """Test that division by zero raises appropriate error."""
    with pytest.raises(ValueError, match="Cannot divide by zero"):
        divide_numbers(10, 0)


def test_divide_normal():
    """Test normal division."""
    assert divide_numbers(10, 2) == 5.0

How do I actually run these tests?

The beauty of pytest is its simplicity:

# Run all tests
pytest

# Run tests with more details
pytest -v

# Run just one test file
pytest tests/test_math_utils.py

# Run just one specific test
pytest tests/test_math_utils.py::test_add_numbers

Configuring pytest for your project

I like to configure pytest once and forget about it. You can do this in your pyproject.toml:

[tool.pytest.ini_options]
addopts = "-v --tb=short"

This tells pytest how to behave on every run. If you use an import layout like src/myproject, also set pythonpath (or pytest's norecursedirs) so imports resolve cleanly.

What are some advanced patterns worth knowing?

Parametrized tests - testing multiple scenarios efficiently

Sometimes you want to test the same function with many different inputs. Parametrized tests are perfect for this:

@pytest.mark.parametrize("a,b,expected", [
    (1, 2, 3),
    (0, 0, 0),
    (-1, 1, 0),
    (10, -5, 5),
])
def test_add_numbers_parametrized(a, b, expected):
    """Test add_numbers with multiple parameter sets."""
    assert add_numbers(a, b) == expected

This runs four separate tests with different inputs, but with much less code duplication.

Testing file operations

Sometimes you need to test functions that read or write files. Here's a safe way to do it:

import os
import tempfile


def test_file_processing():
    """Test function that processes files."""
    with tempfile.NamedTemporaryFile(mode="w", delete=False) as f:
        f.write("test,data\n1,2\n3,4\n")
        temp_file = f.name

    try:
        # Test your file processing function
        result = process_csv_file(temp_file)
        assert len(result) == 2
    finally:
        os.unlink(temp_file)

This creates a temporary file, tests your function with it, then cleans up automatically.

How do I organize my tests?

I've found it's easiest to mirror your source code structure in your test directory. This makes it obvious where to find tests for any given module:

src/
└── myproject/
    ├── __init__.py
    ├── feature_engineering.py
    ├── models/
    │   ├── __init__.py
    │   ├── linear_model.py
    │   └── tree_model.py
    └── utils/
        ├── __init__.py
        ├── data_loader.py
        └── metrics.py

tests/
├── conftest.py                    # Shared fixtures
├── test_feature_engineering.py   # Tests for feature_engineering.py
├── models/
│   ├── test_linear_model.py       # Tests for models/linear_model.py
│   └── test_tree_model.py         # Tests for models/tree_model.py
├── utils/
│   ├── test_data_loader.py        # Tests for utils/data_loader.py
│   └── test_metrics.py            # Tests for utils/metrics.py
└── integration/
    └── test_full_pipeline.py       # End-to-end tests

This structure makes it immediately clear which test file corresponds to which source module. When I'm working on src/myproject/models/linear_model.py, I know exactly where to find its tests: tests/models/test_linear_model.py.

I put shared fixtures in conftest.py files at the appropriate level - the root conftest.py for project-wide fixtures, and subdirectory conftest.py files for module-specific shared fixtures.

After folders make ownership obvious, the remaining readability wins come from how you name behaviors and arrange each test body.

What about test naming?

I use descriptive names that tell me exactly what's being tested:

test_calculate_mean_with_empty_list
test_clean_dataframe_removes_nan_rows
test_divide_by_zero_raises_error

The pattern I follow is: test_function_name_when_condition_then_expected_result

How do I structure individual tests?

I follow the AAA pattern - Arrange, Act, Assert:

def test_function_name():
    """Test description."""
    # Arrange: Set up test data
    data = [1, 2, 3, 4, 5]

    # Act: Execute the function
    result = calculate_mean(data)

    # Assert: Check the result
    assert result == 3.0

This makes tests easy to read and understand.

How do I know if I'm testing enough?

I use pytest-cov to check test coverage:

# Install coverage
pixi add pytest-cov

# Run tests with coverage
pytest --cov=src

# Generate a nice HTML report
pytest --cov=src --cov-report=html

But here's the thing: 100% coverage doesn't mean your tests are perfect. It just means every line of code gets executed. Focus on testing the behavior that matters.

How does this fit into my development workflow?

I add tasks to pyproject.toml under Pixi's [tool.pixi.tasks] block so I can run tests without remembering long commands:

[tool.pixi.tasks]
test = { cmd = "pytest" }
test-cov = { cmd = "pytest --cov=src --cov-report=term-missing" }

Then:

pixi run test

That same convention ports cleanly into CI on every pull request. This is the place where the scope note above becomes concrete: for how those workflows are structured and what belongs in YAML versus in narrative docs, see Use CI/CD to automate tasks and Store your project documentation in your project repository.

What mistakes should I avoid?

Here are the testing antipatterns I've learned to avoid:

Testing implementation details - Test what the function does, not how it does it
Overly complex tests - If your test is hard to understand, it's probably testing too much
Testing external dependencies - Mock APIs and databases; test your code, not theirs
Skipping edge cases - Empty inputs, zero values, negative numbers - these are where bugs hide
Tests that depend on each other - Each test should be able to run independently

How do I get started?

Don't try to test everything at once. Start small:

Pick one important function that you rely on
Write a simple test that checks it works with normal inputs
Add a test for an edge case
Run the tests and watch them pass
Make a small change to your function and watch the tests catch any problems

The goal isn't perfect test coverage from day one. It's building confidence in your code, one test at a time.

Remember: tests aren't about catching every possible bug. They're about catching the bugs that would otherwise ruin your day. Start with the functions you care most about, and grow your test suite as your code grows.

Want to dive deeper? I've written a longer essay on testing for data scientists that goes into more detail on testing philosophy and advanced patterns.