Use data catalogs to manage data
Here's what I've found works best for managing data in data science projects: treat your data access like a well-organized library. Instead of scattered scripts that load data in different ways, I create data catalogs - Python modules with clean, importable functions that handle all the messy details of data access.
What are data catalogs?
I think about data catalogs as a single source of truth for all your data assets. They're Python modules with functions like load_customer_data()
or load_2024_sales_data()
that encapsulate everything needed to access your data. The beauty of this approach is that you never have to remember connection strings, file paths, or data preprocessing steps - it's all handled in one place.
Behind these functions are the configurations needed to access the data, following 12-factor app principles. The docstring provides the description of the data, giving you both the data and its metadata in a single place.
Traditional data catalog examples
Loading from a database
Here's how I handle database connections. No more scattered connection strings or hardcoded credentials:
import os
import psycopg2
import pandas as pd
def load_customer_data():
"""
Loads customer data from the production database.
Returns:
DataFrame: Customer data with columns:
- customer_id (int): Unique customer identifier
- name (str): Customer full name
- email (str): Customer email address
- signup_date (datetime): When customer signed up
- total_orders (int): Total number of orders placed
"""
connection = psycopg2.connect(
dbname=os.getenv('DB_NAME'),
user=os.getenv('DB_USER'),
password=os.getenv('DB_PASSWORD'),
host=os.getenv('DB_HOST'),
port=os.getenv('DB_PORT')
)
query = """
SELECT
customer_id,
name,
email,
signup_date,
total_orders
FROM customers
WHERE active = true
"""
df = pd.read_sql_query(query, connection)
connection.close()
return df
Processing files into a unified dataset
When I need to combine multiple data sources, I handle all the complexity in the catalog function:
import pandas as pd
from pathlib import Path
def load_2024_sales_data():
"""
Loads and combines Q1 sales data from multiple sources.
Returns:
DataFrame: Combined sales data with columns:
- date (datetime): Sale date
- product_id (str): Product identifier
- quantity (int): Units sold
- revenue (float): Total revenue in USD
- channel (str): Sales channel (online/retail)
"""
data_dir = Path("data/sales/2024")
# Load from different sources
online_df = pd.read_csv(data_dir / "online_sales_q1.csv")
retail_df = pd.read_csv(data_dir / "retail_sales_q1.csv")
# Standardize column names
online_df['channel'] = 'online'
retail_df['channel'] = 'retail'
# Combine and clean
combined_df = pd.concat([online_df, retail_df], ignore_index=True)
combined_df['date'] = pd.to_datetime(combined_df['date'])
return combined_df
Modern ML data storage with xarray and zarr
The future of ML data splits
Most data scientists still use pickle files for storing train/test/validation splits. But here's what I've discovered: xarray datasets are the natural data structure for ML data splits.
Here's why this approach is revolutionary for ML workflows:
Why xarray beats pickle for ML data
Think about what you're actually storing when you create data splits:
- Images or features (the X data)
- Labels (the y data)
- The split assignment (train/test/val)
Traditional approaches scatter this information across multiple files or lose the split information entirely. With xarray, you store everything together:
import xarray as xr
import numpy as np
def create_ml_dataset_splits(images, labels, split_assignments):
"""
Creates a unified xarray dataset containing images, labels, and split assignments.
Args:
images (np.array): Image data (n_samples, height, width, channels)
labels (np.array): Label data (n_samples,)
split_assignments (np.array): Split labels ['train', 'test', 'val']
Returns:
xarray.Dataset: Unified dataset with all ML data and metadata
"""
ds = xr.Dataset({
'images': (['sample', 'height', 'width', 'channel'], images),
'labels': (['sample'], labels),
'split': (['sample'], split_assignments),
})
# Add metadata
ds.attrs['created_date'] = pd.Timestamp.now().isoformat()
ds.attrs['n_classes'] = len(np.unique(labels))
ds.attrs['image_shape'] = images.shape[1:]
return ds
def load_training_data():
"""
Loads training split from the ML dataset.
Returns:
tuple: (X_train, y_train) ready for model training
"""
# Load from S3/cloud storage if using zarr format
ds = xr.open_zarr('s3://my-bucket/ml-dataset.zarr')
# Filter to training data only
train_data = ds.where(ds.split == 'train', drop=True)
return train_data.images.values, train_data.labels.values
The zarr advantage for cloud storage
When you store xarray datasets in zarr format, you get cloud-native data access. This means you can query your data splits directly from S3 without downloading entire datasets:
import xarray as xr
def load_validation_batch(batch_size=32):
"""
Loads a random batch of validation data directly from S3.
No need to download the entire dataset - zarr handles the streaming.
"""
# Open dataset directly from S3
ds = xr.open_zarr('s3://my-bucket/ml-dataset.zarr')
# Get validation data
val_data = ds.where(ds.split == 'val', drop=True)
# Random sampling
indices = np.random.choice(len(val_data.sample), batch_size, replace=False)
batch = val_data.isel(sample=indices)
return batch.images.values, batch.labels.values
What are the advantages of this approach?
- Single source of truth: Data, labels, and split assignments live together
- Cloud-native: Query directly from S3 without downloading
- Metadata included: Dataset shape, creation date, class information all stored together
- Scalable: Works with datasets too large to fit in memory
- Reproducible: Split assignments are permanent and queryable
I've lost count of how many times I've seen teams lose track of which samples were in which split. This approach eliminates that problem entirely.
What are the advantages of data catalogs?
The wins
- Single source of truth: One place to find all your data access patterns
- Clean abstraction: Your analysis code stays focused on analysis, not data plumbing
- Scalability: Adding new data sources doesn't break existing code
- Documentation: Docstrings provide data dictionaries alongside the data
- Configuration management: Environment variables handle different deployment contexts
- Reproducibility: Same function call works across different environments
The challenges
- Initial setup cost: Takes time to build the catalog functions upfront
- Dependency management: Need to handle database drivers, cloud credentials, etc.
- Version management: Data schemas can change over time
- Team coordination: Everyone needs to use the catalog instead of ad-hoc data loading
When should you use data catalogs?
Just-in-time adoption of a practice is preferable to over-engineering. Start with data catalogs when:
- You're loading the same data in multiple places
- Your team is growing and needs consistent data access
- You're dealing with complex data preprocessing
- You want to separate data access from analysis logic
Remember: the goal isn't to replace simple pd.read_csv()
calls, but to amplify your team's ability to work with complex, multi-source data consistently.
Time will distill the best practices for your specific context, but this approach gives you a solid foundation for scalable data management.