How to write software tests

I'll be honest with you: I used to think testing was a waste of time. "I'll just be careful," I told myself. "I know my code works."

Then came graduate school, and my thesis project changed everything. I had built this complex analysis pipeline that took a full week to run from start to finish. Every time I wanted to make even a small change - maybe fix a bug or try a different approach - I had to run the entire pipeline to make sure I hadn't broken anything. One week to see if my change worked. One week to discover I'd introduced a bug three functions away from where I made the change.

I was paralyzed. I couldn't make progress because I didn't have confidence in my code. The feedback loop was so slow that refactoring became impossible. That's when I learned that testing isn't about being a perfectionist - it's about having the confidence to make your code better.

Why should I bother writing tests?

Here's what I learned from that painful graduate school experience: tests enable refactoring, and refactoring is how good code evolves. Without tests, your code becomes frozen in its initial, imperfect state because changing anything feels too risky.

Data science code has a sneaky way of growing more complex than you originally planned. You start with a simple analysis script, then you add a helper function, then another, and before you know it, you've got a mini-library that other parts of your project depend on.

As you develop this codebase, you'll inevitably need to modify existing code. Maybe you find a more efficient algorithm, or you need to handle a new edge case, or you realize your initial approach was naive. Without tests, every change becomes a game of "did I break something else?" With tests, you can refactor fearlessly, knowing that if you break something, you'll know immediately - not after waiting a week.

Beyond catching bugs, I've found tests serve as documentation that actually stays up-to-date. When I come back to code I wrote six months ago, the tests remind me exactly how functions are supposed to behave.

How do I actually write tests?

The general pattern is beautifully simple:

Write a function that does something meaningful (not just wrapping another function, but actual work)
Test that function by giving it examples and checking that it produces the expected results

My go-to test runner: pytest

I've tried various testing frameworks over the years, but I always come back to pytest. It's incredibly easy to get started with, yet powerful enough to handle complex testing scenarios as your projects grow.

Getting started is as simple as:

pixi add pytest  # or pip install pytest if you're using pip

What do real tests look like?

Let me show you some actual examples from the kinds of functions I write in data science projects.

Testing data science functions

Let me show you some realistic examples - the kind of functions you'd actually write in a data science project:

# In src/myproject/feature_engineering.py
import numpy as np from typing import List, Optional

def normalize_features(values: List[float], method: str = "standard") -> List[float]:
    """Normalize a list of numerical features.

    Args:
        values: List of numerical values to normalize
        method: 'standard' for z-score, 'minmax' for min-max scaling

    Returns:
        List of normalized values
    """
    if not values:
        return []

    values_array = np.array(values)

    if method == "standard":
        if np.std(values_array) == 0:
            return [0.0] * len(values)
        return ((values_array - np.mean(values_array)) / np.std(values_array)).tolist()
    elif method == "minmax":
        min_val, max_val = np.min(values_array), np.max(values_array)
        if min_val == max_val:
            return [0.0] * len(values)
        return ((values_array - min_val) / (max_val - min_val)).tolist()
    else:
        raise ValueError(f"Unknown normalization method: {method}")

def extract_time_features(timestamp: str) -> dict:
    """Extract time-based features from a timestamp.

    Args:
        timestamp: ISO format timestamp string

    Returns:
        Dictionary with extracted features
    """
    from datetime import datetime

    dt = datetime.fromisoformat(timestamp.replace('Z', '+00:00'))

    return {
        'hour': dt.hour,
        'day_of_week': dt.weekday(),
        'is_weekend': dt.weekday() >= 5,
        'quarter': (dt.month - 1) // 3 + 1
    }

These functions are more representative of what we actually do - they handle edge cases, have multiple code paths, and involve some complexity. Here's how I test them:

# In tests/test_feature_engineering.py
import pytest import numpy as np from src.myproject.feature_engineering import normalize_features, extract_time_features

def test_normalize_features_standard():
    """Test standard normalization."""
    values = [1.0, 2.0, 3.0, 4.0, 5.0]
    result = normalize_features(values, "standard")

    # Should have mean ~0 and std ~1
    assert abs(np.mean(result)) < 1e-10
    assert abs(np.std(result) - 1.0) < 1e-10

def test_normalize_features_minmax():
    """Test min-max normalization."""
    values = [10.0, 20.0, 30.0]
    result = normalize_features(values, "minmax")

    # Should be scaled to [0, 1]
    assert result == [0.0, 0.5, 1.0]

def test_normalize_features_edge_cases():
    """Test edge cases for normalization."""
    # Empty list
    assert normalize_features([]) == []

    # All same values
    assert normalize_features([5.0, 5.0, 5.0], "standard") == [0.0, 0.0, 0.0]
    assert normalize_features([5.0, 5.0, 5.0], "minmax") == [0.0, 0.0, 0.0]

    # Invalid method
    with pytest.raises(ValueError, match="Unknown normalization method"):
        normalize_features([1, 2, 3], "invalid")

def test_extract_time_features():
    """Test time feature extraction."""
    # Tuesday, January 3rd, 2023 at 2:30 PM
    timestamp = "2023-01-03T14:30:00Z"
    result = extract_time_features(timestamp)

    expected = {
        'hour': 14,
        'day_of_week': 1,  # Tuesday
        'is_weekend': False,
        'quarter': 1
    }
    assert result == expected

def test_extract_time_features_weekend():
    """Test weekend detection."""
    # Saturday
    timestamp = "2023-01-07T10:00:00Z"
    result = extract_time_features(timestamp)

    assert result['is_weekend'] is True
    assert result['day_of_week'] == 5

Notice how these tests cover the kinds of issues that actually matter in data science: handling edge cases like empty inputs, all-identical values, and invalid parameters. This is the level of complexity where testing really pays off.

What about testing DataFrames?

You might notice I'm not showing examples of testing pandas DataFrames directly. That's intentional! For DataFrame validation - checking that your data has the right columns, types, and value ranges - I recommend using specialized tools like Pandera rather than writing custom tests.

Pandera lets you define schemas that validate your data automatically:

import pandera as pa from pandera import Column, Check

# Define what your DataFrame should look like
schema = pa.DataFrameSchema({
    "age": Column(int, checks=Check.between(0, 120)),
    "salary": Column(float, checks=Check.greater_than(0)),
    "name": Column(str, checks=Check.str_length(1, 50))
})

# Validate your DataFrame
validated_df = schema(df)  # Raises error if validation fails

This is much more powerful and maintainable than writing custom assertion-based tests for data validation.

Using fixtures to avoid repetitive setup

One thing I learned the hard way is that creating test data over and over gets tedious. Pytest fixtures solve this beautifully:

# In tests/conftest.py
import pytest import pandas as pd

@pytest.fixture def sample_dataframe():
    """Create a sample DataFrame for testing."""
    return pd.DataFrame({
        'name': ['Alice', 'Bob', 'Charlie'],
        'age': [25, 30, 35],
        'salary': [50000, 60000, 70000]
    })

@pytest.fixture def messy_dataframe():
    """Create a DataFrame with missing values."""
    return pd.DataFrame({
        'A': [1, 2, None, 4, 5],
        'B': [None, 2, 3, 4, None],
        'C': [1, 2, 3, 4, 5]
    })

Now I can use these fixtures in any test:

# In tests/test_with_fixtures.py
def test_with_sample_data(sample_dataframe):
    """Test using the sample DataFrame fixture."""
    assert len(sample_dataframe) == 3
    assert 'name' in sample_dataframe.columns
    assert sample_dataframe['age'].mean() == 30

def test_cleaning_with_messy_data(messy_dataframe):
    """Test cleaning with messy data fixture."""
    clean_df = clean_dataframe(messy_dataframe)
    assert len(clean_df) == 2  # Only 2 rows have no NaN values

What about testing data assumptions?

This is something I wish I'd learned earlier: you should test not just your functions, but your assumptions about the data itself:

def test_data_validation(sample_dataframe):
    """Test that data meets expected criteria."""
    # Test data types
    assert sample_dataframe['age'].dtype == 'int64'
    assert sample_dataframe['salary'].dtype == 'int64'

    # Test value ranges
    assert (sample_dataframe['age'] >= 0).all()
    assert (sample_dataframe['salary'] >= 0).all()

    # Test for required columns
    required_columns = ['name', 'age', 'salary']
    assert all(col in sample_dataframe.columns for col in required_columns)

    # Test for no missing values in critical columns
    assert sample_dataframe['name'].notna().all()

These kinds of tests have saved me countless times when data sources change unexpectedly.

Don't forget to test error conditions

Your functions should fail gracefully, and you should test that they do:

def divide_numbers(a, b):
    """Divide two numbers."""
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b

def test_divide_by_zero():
    """Test that division by zero raises appropriate error."""
    with pytest.raises(ValueError, match="Cannot divide by zero"):
        divide_numbers(10, 0)

def test_divide_normal():
    """Test normal division."""
    assert divide_numbers(10, 2) == 5.0

How do I actually run these tests?

The beauty of pytest is its simplicity:

# Run all tests
pytest

# Run tests with more details
pytest -v

# Run just one test file
pytest tests/test_math_utils.py

# Run just one specific test
pytest tests/test_math_utils.py::test_add_numbers

Configuring pytest for your project

I like to configure pytest once and forget about it. You can do this in your pyproject.toml:

[tool.pytest.ini_options] addopts = "-v --tb=short"

This tells pytest where to find tests and how to run them.

What are some advanced patterns worth knowing?

Parametrized tests - testing multiple scenarios efficiently

Sometimes you want to test the same function with many different inputs. Parametrized tests are perfect for this:

@pytest.mark.parametrize("a,b,expected", [
    (1, 2, 3),
    (0, 0, 0),
    (-1, 1, 0),
    (10, -5, 5)
]) def test_add_numbers_parametrized(a, b, expected):
    """Test add_numbers with multiple parameter sets."""
    assert add_numbers(a, b) == expected

This runs four separate tests with different inputs, but with much less code duplication.

Testing file operations

Sometimes you need to test functions that read or write files. Here's a safe way to do it:

import tempfile import os

def test_file_processing():
    """Test function that processes files."""
    with tempfile.NamedTemporaryFile(mode='w', delete=False) as f:
        f.write("test,data\n1,2\n3,4\n")
        temp_file = f.name

    try:
        # Test your file processing function
        result = process_csv_file(temp_file)
        assert len(result) == 2
    finally:
        os.unlink(temp_file)

This creates a temporary file, tests your function with it, then cleans up automatically.

How do I organize my tests?

I've found it's easiest to mirror your source code structure in your test directory. This makes it obvious where to find tests for any given module:

src/ └── myproject/
    ├── __init__.py
    ├── feature_engineering.py
    ├── models/
    │   ├── __init__.py
    │   ├── linear_model.py
    │   └── tree_model.py
    └── utils/
        ├── __init__.py
        ├── data_loader.py
        └── metrics.py

tests/ ├── conftest.py                    # Shared fixtures ├── test_feature_engineering.py   # Tests for feature_engineering.py ├── models/ │   ├── test_linear_model.py       # Tests for models/linear_model.py │   └── test_tree_model.py         # Tests for models/tree_model.py ├── utils/ │   ├── test_data_loader.py        # Tests for utils/data_loader.py │   └── test_metrics.py            # Tests for utils/metrics.py └── integration/
    └── test_full_pipeline.py       # End-to-end tests

This structure makes it immediately clear which test file corresponds to which source module. When I'm working on src/myproject/models/linear_model.py, I know exactly where to find its tests: tests/models/test_linear_model.py.

I put shared fixtures in conftest.py files at the appropriate level - the root conftest.py for project-wide fixtures, and subdirectory conftest.py files for module-specific shared fixtures.

What about test naming?

I use descriptive names that tell me exactly what's being tested:

test_calculate_mean_with_empty_list
test_clean_dataframe_removes_nan_rows
test_divide_by_zero_raises_error

The pattern I follow is: test_function_name_when_condition_then_expected_result

How do I structure individual tests?

I follow the AAA pattern - Arrange, Act, Assert:

def test_function_name():
    """Test description."""
    # Arrange: Set up test data
    data = [1, 2, 3, 4, 5]

    # Act: Execute the function
    result = calculate_mean(data)

    # Assert: Check the result
    assert result == 3.0

This makes tests easy to read and understand.

How do I know if I'm testing enough?

I use pytest-cov to check test coverage:

# Install coverage
pixi add pytest-cov

# Run tests with coverage
pytest --cov=src

# Generate a nice HTML report
pytest --cov=src --cov-report=html

But here's the thing: 100% coverage doesn't mean your tests are perfect. It just means every line of code gets executed. Focus on testing the behavior that matters.

How does this fit into my development workflow?

I add testing to my pixi.toml file so I can run tests easily:

[tasks] test = "pytest" test-cov = "pytest --cov=src --cov-report=term-missing"

Then I can just run:

pixi run test

And in my CI pipeline, this same command runs automatically on every commit.

What mistakes should I avoid?

Here are the testing antipatterns I've learned to avoid:

Testing implementation details - Test what the function does, not how it does it
Overly complex tests - If your test is hard to understand, it's probably testing too much
Testing external dependencies - Mock APIs and databases; test your code, not theirs
Skipping edge cases - Empty inputs, zero values, negative numbers - these are where bugs hide
Tests that depend on each other - Each test should be able to run independently

How do I get started?

Don't try to test everything at once. Start small:

Pick one important function that you rely on
Write a simple test that checks it works with normal inputs
Add a test for an edge case
Run the tests and watch them pass
Make a small change to your function and watch the tests catch any problems

The goal isn't perfect test coverage from day one. It's building confidence in your code, one test at a time.

Remember: tests aren't about catching every possible bug. They're about catching the bugs that would otherwise ruin your day. Start with the functions you care most about, and grow your test suite as your code grows.

Want to dive deeper? I've written a longer essay on testing for data scientists that goes into more detail on testing philosophy and advanced patterns.