Choose your data formats wisely

I've lost count of how many times I've seen analysis pipelines break because of poor data format choices. A column of zero-padded integers gets interpreted as regular numbers, floating-point precision gets mangled, or metadata gets lost in translation. These aren't just annoyances - they're the kind of silent failures that can invalidate entire analyses.

The right format choices can make your data self-documenting and your workflows bulletproof. The wrong choices create coordination nightmares that compound over time.

The binary vs text format decision

When choosing data formats, you're fundamentally deciding between binary formats (Parquet, Feather, HDF5, Zarr) and text formats (CSV, JSON, YAML). I've found that this choice often comes down to whether you're a programmer or not.

Why binary formats win for source of truth

I'm a programmer, so I naturally lean toward binary formats. But this isn't just preference - it's about solving real problems I've encountered repeatedly.

Metadata preservation: Binary formats preserve column types, missing value indicators, and other metadata that text formats lose. Consider a column of zero-padded integers stored as strings (e.g., "00015"). If you save that to CSV, some readers will interpret it as 15, losing the leading zeros that might be crucial for your analysis. I've seen this exact issue break downstream analysis pipelines.

Precision and fidelity: Text formats can introduce rounding errors and precision issues, especially with floating-point numbers. Binary formats maintain exact numerical precision.

Performance: Binary formats are significantly faster to read and write, especially for large datasets. They also compress better, reducing storage costs.

Type safety: Binary formats enforce data types, preventing the silent type conversions that can break analysis pipelines.

When text formats serve a purpose

But I also recognize that non-programmers may want to be able to access plain text or Excel versions of the data you're sharing with them. Text formats have their place, but as derivative outputs rather than source of truth:

Human readability: Non-programmers can easily inspect and understand CSV files. Business stakeholders often need Excel-compatible formats.

Version control: Text formats work well with git for small datasets, allowing you to track changes over time.

Interoperability: Many tools and platforms expect CSV or JSON inputs.

Debugging: Text formats make it easy to quickly inspect data during development.

The hybrid approach: Binary source, text derivatives

My recommendation is to declare that the source of truth is the binary format, but that you have derivative CSV files that can just be CSV dumps and the likes. That's all totally fine and cool.

Now all you just need to do is to make sure that people have the right business understanding that your CSV is not the source of truth anymore. That has to be a proper understanding.

In some ways, this approach overrides the simple binary vs text choice. It's about establishing a clear hierarchy:

Binary formats (Parquet/Feather) as source of truth - preserving all metadata and precision
Text formats (CSV/Excel) as derivative outputs - for stakeholder consumption
Clear business understanding - everyone knows CSV is not the authoritative source

Implementation workflow

# Your source of truth - binary format with full metadata
df.to_parquet("data/processed/experiment_results.parquet")

# Derivative outputs for stakeholders
df.to_csv("data/exports/experiment_results.csv", index=False)

# For Excel users
df.to_excel("data/exports/experiment_results.xlsx", index=False)

Managing expectations

The key is establishing clear communication:

CSV files are for consumption, not editing - any changes should go back to the source data
Binary files contain the authoritative data - use these for analysis and processing
Automated generation - set up pipelines to automatically regenerate CSV exports when source data changes

This approach respects both the technical needs of data scientists and the practical needs of business stakeholders. It's not about choosing one over the other - it's about making each format serve its purpose in your workflow.

High-dimensional data: The xarray advantage

For very high-dimensional or multi-dimensional data storage, I think the xarray format is another very good format to lean into. I wrote a blog post as an example of what is possible with xarray.

The coordinate-based paradigm

Xarray makes coordinates the foundation of your data structure. Instead of managing separate files with separate indexing schemes, you create one unified dataset where every piece of data knows exactly which experimental unit it belongs to.

# Unified dataset with meaningful coordinates
unified_dataset = xr.Dataset({
    'expression_level': (['mirna', 'treatment', 'time_point', 'replicate'], expression_data),
    'ml_features': (['mirna', 'feature'], feature_matrix),
    'model_results': (['mirna'], model_predictions),
    'train_mask': (['mirna', 'split_type'], train_masks)
})

Benefits of unified storage

The beauty of this approach is that every piece of data is automatically aligned by shared coordinates. No more manual bookkeeping or index juggling across multiple files. When you slice your data, everything stays aligned automatically - you don't have to worry about applying the same filtering to all your files.

Store your unified dataset in Zarr format and it becomes cloud-ready, supporting chunking, compression, and parallel access. You can start with core experimental data, then progressively add statistical results, ML features, and data splits. Each stage builds on the previous coordinate system, so everything stays connected.

When to use xarray

Xarray shines for:

Laboratory data with multiple experimental factors
Time series with multiple dimensions
Machine learning workflows with features, targets, and splits
Geospatial data with coordinates and time
Any data where you need to maintain relationships across multiple files

Format-specific recommendations

Data Type	Source of Truth	Why	Derivatives	When to Use
Tabular data	Parquet or Feather	Excellent compression, preserves types, language-agnostic	CSV for stakeholders, JSON for APIs	Most data science workflows
High-dimensional data	Zarr (via xarray)	Cloud-native, preserves coordinates, scales to terabytes	NetCDF for scientific workflows	Laboratory data, time series, ML workflows
Configuration	YAML or TOML	Human-readable, supports comments, version control friendly	JSON for programmatic access	Project configs, metadata
Small datasets	CSV or JSON	Version control friendly, human readable	None needed	Prototyping, small analyses
Large time series	Zarr	Efficient compression, supports chunking	CSV for specific time ranges	Sensor data, financial data

How to implement this in practice

Now that you understand the format choices, here's how to actually implement this in your projects. I've found that the key is to start simple and build up complexity as you need it.

Start with your project structure

First, organize your data directories to reflect the source-of-truth principle. Important: Data should not live in your repository. Use external storage (S3, cloud storage, etc.) for your actual data, and only keep small subsamples in the repo for testing.

project/
├── data/          # Small test datasets only (gitignored)
│   ├── samples/   # Subsamples for unit testing
│   └── temp/      # Temporary files (gitignored)
├── scripts/       # Data processing scripts
└── config/        # Data access configuration

For your actual data, use external storage with clear paths:

s3://your-bucket/project-name/
├── raw/           # Original data as received
├── processed/     # Your binary source of truth
└── exports/       # Text derivatives for stakeholders

This keeps your repo clean while maintaining the same organizational principles.

Set up automated conversions

Don't manually convert formats every time. Set up simple scripts that regenerate your derivatives automatically:

# scripts/export_data.py
import pandas as pd
import boto3
from pathlib import Path

def export_derivatives():
    """Export binary data to text formats for stakeholders."""
    s3 = boto3.client('s3')
    bucket = "your-bucket"

    # List processed files in S3
    response = s3.list_objects_v2(
        Bucket=bucket,
        Prefix="project-name/processed/"
    )

    for obj in response.get('Contents', []):
        if obj['Key'].endswith('.parquet'):
            # Download and process
            s3.download_file(bucket, obj['Key'], '/tmp/temp.parquet')
            df = pd.read_parquet('/tmp/temp.parquet')

            # Upload CSV derivative
            csv_key = obj['Key'].replace('/processed/', '/exports/').replace('.parquet', '.csv')
            df.to_csv('/tmp/temp.csv', index=False)
            s3.upload_file('/tmp/temp.csv', bucket, csv_key)

Run this script whenever your source data changes, or better yet, hook it into your data processing pipeline.

Document your choices

Put this in your project README so everyone knows the rules:

## Data Formats

- **Source of truth**: Binary formats (Parquet/Feather) in S3 `processed/` folder
- **Derivatives**: Text formats (CSV/Excel) in S3 `exports/` folder
- **Rule**: Never edit CSV files directly - changes go back to source data
- **Local data**: Only small test samples in `data/samples/` (gitignored)

This eliminates confusion and sets clear expectations for your team.

Handle the performance trade-offs

For large datasets, you'll hit performance bottlenecks. Here's what I've learned:

Profile first: Use %timeit in Jupyter or Python's cProfile to see where your I/O is slow. You might be surprised - sometimes the bottleneck isn't where you think.

Chunk your data: For datasets that don't fit in memory, use chunking:

# Read large files in chunks
for chunk in pd.read_csv("huge_file.csv", chunksize=10000):
    # Process each chunk
    chunk.to_parquet(f"processed/chunk_{i}.parquet")

Use memory mapping: For very large files, consider memory mapping with libraries like mmap or dask.

The bottom line

The goal isn't to pick one format over another, but to establish a clear hierarchy where each format serves its purpose in your data pipeline. Your source of truth should be robust and precise, while your derivatives should be accessible and useful for the people who need them.

Choose formats that preserve your data integrity, serve your stakeholders, scale with your needs, and support your workflow. The right choices compound over time, making your projects more maintainable and your collaborations smoother.