Use data catalogs to manage data
Here's what I've found works best for managing data in data science projects: treat your data access like a well-organized library. Instead of scattered scripts that load data in different ways, I create data catalogs - Python modules with clean, importable functions that handle all the messy details of data access.
What are data catalogs?
I think about data catalogs as a single source of truth for all your data assets. They're Python modules with functions like load_customer_data()
or load_2024_sales_data()
that encapsulate everything needed to access your data. The beauty of this approach is that you never have to remember connection strings, file paths, or data preprocessing steps - it's all handled in one place.
Behind these functions are the configurations needed to access the data, following 12-factor app principles. The docstring provides the description of the data, giving you both the data and its metadata in a single place.
Traditional data catalog examples
Loading from a database
Here's how I handle database connections. No more scattered connection strings or hardcoded credentials:
import os
import psycopg2
import pandas as pd
def load_customer_data():
"""
Load active customer data from the production database.
:returns: Customer data with columns: customer_id, name, email, signup_date, total_orders
:rtype: pandas.DataFrame
"""
connection = psycopg2.connect(
dbname=os.getenv('DB_NAME'),
user=os.getenv('DB_USER'),
password=os.getenv('DB_PASSWORD'),
host=os.getenv('DB_HOST'),
port=os.getenv('DB_PORT')
)
query = """
SELECT customer_id, name, email, signup_date, total_orders
FROM customers
WHERE active = true
"""
df = pd.read_sql_query(query, connection)
connection.close()
return df
Processing files into a unified dataset
When I need to combine multiple data sources, I handle all the complexity in the catalog function:
import pandas as pd
from pathlib import Path
def load_2024_sales_data():
"""
Load and combine Q1 sales data from multiple sources.
:returns: Combined sales data with columns: date, product_id, quantity, revenue, channel
:rtype: pandas.DataFrame
"""
data_dir = Path("data/sales/2024")
# Load from different sources
online_df = pd.read_csv(data_dir / "online_sales_q1.csv")
retail_df = pd.read_csv(data_dir / "retail_sales_q1.csv")
# Standardize column names
online_df['channel'] = 'online'
retail_df['channel'] = 'retail'
# Combine and clean
combined_df = pd.concat([online_df, retail_df], ignore_index=True)
combined_df['date'] = pd.to_datetime(combined_df['date'])
return combined_df
Modern ML data storage with xarray and zarr
The future of ML data splits
Most data scientists still use pickle files for storing train/test/validation splits. But here's what I've discovered: xarray datasets are the natural data structure for ML data splits.
Here's why this approach is revolutionary for ML workflows:
Why xarray beats pickle for ML data
Think about what you're actually storing when you create data splits:
- Images or features (the X data)
- Labels (the y data)
- The split assignment (train/test/val)
Traditional approaches scatter this information across multiple files or lose the split information entirely. With xarray, you store everything together:
import xarray as xr
import numpy as np
def create_ml_dataset_splits(images, labels, split_assignments):
"""
Create a unified xarray dataset containing images, labels, and split assignments.
:param images: Image data (n_samples, height, width, channels)
:type images: numpy.ndarray
:param labels: Label data (n_samples,)
:type labels: numpy.ndarray
:param split_assignments: Split labels ['train', 'test', 'val']
:type split_assignments: numpy.ndarray
:returns: Unified dataset with all ML data and metadata
:rtype: xarray.Dataset
"""
ds = xr.Dataset({
'images': (['sample', 'height', 'width', 'channel'], images),
'labels': (['sample'], labels),
'split': (['sample'], split_assignments),
})
# Add metadata
ds.attrs['created_date'] = pd.Timestamp.now().isoformat()
ds.attrs['n_classes'] = len(np.unique(labels))
ds.attrs['image_shape'] = images.shape[1:]
return ds
Now you can load specific splits from this unified dataset:
def load_training_data():
"""
Load training split from the ML dataset.
:returns: (X_train, y_train) ready for model training
:rtype: tuple
"""
# Load from S3/cloud storage if using zarr format
ds = xr.open_zarr('s3://my-bucket/ml-dataset.zarr')
# Filter to training data only
train_data = ds.where(ds.split == 'train', drop=True)
return train_data.images.values, train_data.labels.values
The zarr advantage for cloud storage
When you store xarray datasets in zarr format, you get cloud-native data access. This means you can query your data splits directly from S3 without downloading entire datasets:
import xarray as xr
def load_validation_batch(batch_size=32):
"""
Load a random batch of validation data directly from S3.
No need to download the entire dataset - zarr handles the streaming.
:param batch_size: Number of samples to load
:type batch_size: int
:returns: (X_val, y_val) validation batch
:rtype: tuple
"""
# Open dataset directly from S3
ds = xr.open_zarr('s3://my-bucket/ml-dataset.zarr')
# Get validation data
val_data = ds.where(ds.split == 'val', drop=True)
# Random sampling
indices = np.random.choice(len(val_data.sample), batch_size, replace=False)
batch = val_data.isel(sample=indices)
return batch.images.values, batch.labels.values
What are the advantages of this approach?
- Single source of truth: Data, labels, and split assignments live together
- Cloud-native: Query directly from S3 without downloading
- Metadata included: Dataset shape, creation date, class information all stored together
- Scalable: Works with datasets too large to fit in memory
- Reproducible: Split assignments are permanent and queryable
I've lost count of how many times I've seen teams lose track of which samples were in which split. This approach eliminates that problem entirely.
What are the advantages of data catalogs?
A well-designed data catalog is more than just a convenience; it's a strategic asset for any data-driven project. By centralizing your data access logic, you not only make your codebase easier to maintain, but you also lay the groundwork for reproducibility, scalability, and robust documentation. Of course, as with any abstraction, there are trade-offs to consider. The following table summarizes the key benefits and challenges you can expect when adopting a data catalog approach:
🎯 The Wins | ⚠️ The Challenges |
---|---|
Single source of truth: One place to find all your data access patterns | Initial setup cost: Takes time to build the catalog functions upfront |
Clean abstraction: Your analysis code stays focused on analysis, not data plumbing | Dependency management: Need to handle database drivers, cloud credentials, etc. |
Scalability: Adding new data sources doesn't break existing code | Version management: Data schemas can change over time |
Documentation: Docstrings provide data dictionaries alongside the data | Team coordination: Everyone needs to use the catalog instead of ad-hoc data loading |
Configuration management: Environment variables handle different deployment contexts | |
Reproducibility: Same function call works across different environments |
When should you use data catalogs?
Just-in-time adoption of a practice is preferable to over-engineering. Start with data catalogs when:
- You're loading the same data in multiple places
- Your team is growing and needs consistent data access
- You're dealing with complex data preprocessing
- You want to separate data access from analysis logic
Remember: the goal isn't to replace simple pd.read_csv()
calls, but to amplify your team's ability to work with complex, multi-source data consistently.
Time will distill the best practices for your specific context, but this approach gives you a solid foundation for scalable data management.