Use data catalogs to manage data

Here's what I've found works best for managing data in data science projects: treat your data access like a well-organized library. Instead of scattered scripts that load data in different ways, I create data catalogs - Python modules with clean, importable functions that handle all the messy details of data access.

What are data catalogs?

I think about data catalogs as a single source of truth for all your data assets. They're Python modules with functions like load_customer_data() or load_2024_sales_data() that encapsulate everything needed to access your data. The beauty of this approach is that you never have to remember connection strings, file paths, or data preprocessing steps - it's all handled in one place.

Behind these functions are the configurations needed to access the data, following 12-factor app principles. The docstring provides the description of the data, giving you both the data and its metadata in a single place.

Traditional data catalog examples

Loading from a database

Here's how I handle database connections. No more scattered connection strings or hardcoded credentials:

import os
import psycopg2
import pandas as pd

def load_customer_data():
    """
    Load active customer data from the production database.

    :returns: Customer data with columns: customer_id, name, email, signup_date, total_orders
    :rtype: pandas.DataFrame
    """
    connection = psycopg2.connect(
        dbname=os.getenv('DB_NAME'),
        user=os.getenv('DB_USER'),
        password=os.getenv('DB_PASSWORD'),
        host=os.getenv('DB_HOST'),
        port=os.getenv('DB_PORT')
    )

    query = """
    SELECT customer_id, name, email, signup_date, total_orders
    FROM customers
    WHERE active = true
    """

    df = pd.read_sql_query(query, connection)
    connection.close()
    return df

Processing files into a unified dataset

When I need to combine multiple data sources, I handle all the complexity in the catalog function:

import pandas as pd
from pathlib import Path

def load_2024_sales_data():
    """
    Load and combine Q1 sales data from multiple sources.

    :returns: Combined sales data with columns: date, product_id, quantity, revenue, channel
    :rtype: pandas.DataFrame
    """
    data_dir = Path("data/sales/2024")

    # Load from different sources
    online_df = pd.read_csv(data_dir / "online_sales_q1.csv")
    retail_df = pd.read_csv(data_dir / "retail_sales_q1.csv")

    # Standardize column names
    online_df['channel'] = 'online'
    retail_df['channel'] = 'retail'

    # Combine and clean
    combined_df = pd.concat([online_df, retail_df], ignore_index=True)
    combined_df['date'] = pd.to_datetime(combined_df['date'])

    return combined_df

Modern ML data storage with xarray and zarr

The future of ML data splits

Most data scientists still use pickle files for storing train/test/validation splits. But here's what I've discovered: xarray datasets are the natural data structure for ML data splits.

Here's why this approach is revolutionary for ML workflows:

Why xarray beats pickle for ML data

Think about what you're actually storing when you create data splits:

Images or features (the X data)
Labels (the y data)
The split assignment (train/test/val)

Traditional approaches scatter this information across multiple files or lose the split information entirely. With xarray, you store everything together:

import xarray as xr
import numpy as np

def create_ml_dataset_splits(images, labels, split_assignments):
    """
    Create a unified xarray dataset containing images, labels, and split assignments.

    :param images: Image data (n_samples, height, width, channels)
    :type images: numpy.ndarray
    :param labels: Label data (n_samples,)
    :type labels: numpy.ndarray
    :param split_assignments: Split labels ['train', 'test', 'val']
    :type split_assignments: numpy.ndarray
    :returns: Unified dataset with all ML data and metadata
    :rtype: xarray.Dataset
    """
    ds = xr.Dataset({
        'images': (['sample', 'height', 'width', 'channel'], images),
        'labels': (['sample'], labels),
        'split': (['sample'], split_assignments),
    })

    # Add metadata
    ds.attrs['created_date'] = pd.Timestamp.now().isoformat()
    ds.attrs['n_classes'] = len(np.unique(labels))
    ds.attrs['image_shape'] = images.shape[1:]

    return ds

Now you can load specific splits from this unified dataset:

def load_training_data():
    """
    Load training split from the ML dataset.

    :returns: (X_train, y_train) ready for model training
    :rtype: tuple
    """
    # Load from S3/cloud storage if using zarr format
    ds = xr.open_zarr('s3://my-bucket/ml-dataset.zarr')

    # Filter to training data only
    train_data = ds.where(ds.split == 'train', drop=True)

    return train_data.images.values, train_data.labels.values

The zarr advantage for cloud storage

When you store xarray datasets in zarr format, you get cloud-native data access. This means you can query your data splits directly from S3 without downloading entire datasets:

import xarray as xr

def load_validation_batch(batch_size=32):
    """
    Load a random batch of validation data directly from S3.

    No need to download the entire dataset - zarr handles the streaming.

    :param batch_size: Number of samples to load
    :type batch_size: int
    :returns: (X_val, y_val) validation batch
    :rtype: tuple
    """
    # Open dataset directly from S3
    ds = xr.open_zarr('s3://my-bucket/ml-dataset.zarr')

    # Get validation data
    val_data = ds.where(ds.split == 'val', drop=True)

    # Random sampling
    indices = np.random.choice(len(val_data.sample), batch_size, replace=False)
    batch = val_data.isel(sample=indices)

    return batch.images.values, batch.labels.values

What are the advantages of this approach?

Single source of truth: Data, labels, and split assignments live together
Cloud-native: Query directly from S3 without downloading
Metadata included: Dataset shape, creation date, class information all stored together
Scalable: Works with datasets too large to fit in memory
Reproducible: Split assignments are permanent and queryable

I've lost count of how many times I've seen teams lose track of which samples were in which split. This approach eliminates that problem entirely.

What are the advantages of data catalogs?

A well-designed data catalog is more than just a convenience; it's a strategic asset for any data-driven project. By centralizing your data access logic, you not only make your codebase easier to maintain, but you also lay the groundwork for reproducibility, scalability, and robust documentation. Of course, as with any abstraction, there are trade-offs to consider. The following table summarizes the key benefits and challenges you can expect when adopting a data catalog approach:

🎯 The Wins	⚠️ The Challenges
Single source of truth: One place to find all your data access patterns	Initial setup cost: Takes time to build the catalog functions upfront
Clean abstraction: Your analysis code stays focused on analysis, not data plumbing	Dependency management: Need to handle database drivers, cloud credentials, etc.
Scalability: Adding new data sources doesn't break existing code	Version management: Data schemas can change over time
Documentation: Docstrings provide data dictionaries alongside the data	Team coordination: Everyone needs to use the catalog instead of ad-hoc data loading
Configuration management: Environment variables handle different deployment contexts
Reproducibility: Same function call works across different environments

When should you use data catalogs?

Just-in-time adoption of a practice is preferable to over-engineering. Start with data catalogs when:

You're loading the same data in multiple places
Your team is growing and needs consistent data access
You're dealing with complex data preprocessing
You want to separate data access from analysis logic

Remember: the goal isn't to replace simple pd.read_csv() calls, but to amplify your team's ability to work with complex, multi-source data consistently.

Time will distill the best practices for your specific context, but this approach gives you a solid foundation for scalable data management.