Choose your data formats wisely
I've lost count of how many times I've seen analysis pipelines break because of poor data format choices. A column of zero-padded integers gets interpreted as regular numbers, floating-point precision gets mangled, or metadata gets lost in translation. These aren't just annoyances - they're the kind of silent failures that can invalidate entire analyses.
The right format choices can make your data self-documenting and your workflows bulletproof. The wrong choices create coordination nightmares that compound over time.
The binary vs text format decision
When choosing data formats, you're fundamentally deciding between binary formats (Parquet, Feather, HDF5, Zarr) and text formats (CSV, JSON, YAML). I've found that this choice often comes down to whether you're a programmer or not.
Why binary formats win for source of truth
I'm a programmer, so I naturally lean toward binary formats. But this isn't just preference - it's about solving real problems I've encountered repeatedly.
Metadata preservation: Binary formats preserve column types, missing value indicators, and other metadata that text formats lose. Consider a column of zero-padded integers stored as strings (e.g., "00015"). If you save that to CSV, some readers will interpret it as 15, losing the leading zeros that might be crucial for your analysis. I've seen this exact issue break downstream analysis pipelines.
Precision and fidelity: Text formats can introduce rounding errors and precision issues, especially with floating-point numbers. Binary formats maintain exact numerical precision.
Performance: Binary formats are significantly faster to read and write, especially for large datasets. They also compress better, reducing storage costs.
Type safety: Binary formats enforce data types, preventing the silent type conversions that can break analysis pipelines.
When text formats serve a purpose
But I also recognize that non-programmers may want to be able to access plain text or Excel versions of the data you're sharing with them. Text formats have their place, but as derivative outputs rather than source of truth:
Human readability: Non-programmers can easily inspect and understand CSV files. Business stakeholders often need Excel-compatible formats.
Version control: Text formats work well with git for small datasets, allowing you to track changes over time.
Interoperability: Many tools and platforms expect CSV or JSON inputs.
Debugging: Text formats make it easy to quickly inspect data during development.
The hybrid approach: Binary source, text derivatives
My recommendation is to declare that the source of truth is the binary format, but that you have derivative CSV files that can just be CSV dumps and the likes. That's all totally fine and cool.
Now all you just need to do is to make sure that people have the right business understanding that your CSV is not the source of truth anymore. That has to be a proper understanding.
In some ways, this approach overrides the simple binary vs text choice. It's about establishing a clear hierarchy:
- Binary formats (Parquet/Feather) as source of truth - preserving all metadata and precision
- Text formats (CSV/Excel) as derivative outputs - for stakeholder consumption
- Clear business understanding - everyone knows CSV is not the authoritative source
Implementation workflow
# Your source of truth - binary format with full metadata
df.to_parquet("data/processed/experiment_results.parquet")
# Derivative outputs for stakeholders
df.to_csv("data/exports/experiment_results.csv", index=False)
# For Excel users
df.to_excel("data/exports/experiment_results.xlsx", index=False)
Managing expectations
The key is establishing clear communication:
- CSV files are for consumption, not editing - any changes should go back to the source data
- Binary files contain the authoritative data - use these for analysis and processing
- Automated generation - set up pipelines to automatically regenerate CSV exports when source data changes
This approach respects both the technical needs of data scientists and the practical needs of business stakeholders. It's not about choosing one over the other - it's about making each format serve its purpose in your workflow.
High-dimensional data: The xarray advantage
For very high-dimensional or multi-dimensional data storage, I think the xarray format is another very good format to lean into. I wrote a blog post as an example of what is possible with xarray.
The coordinate-based paradigm
Xarray makes coordinates the foundation of your data structure. Instead of managing separate files with separate indexing schemes, you create one unified dataset where every piece of data knows exactly which experimental unit it belongs to.
# Unified dataset with meaningful coordinates
unified_dataset = xr.Dataset({
'expression_level': (['mirna', 'treatment', 'time_point', 'replicate'], expression_data),
'ml_features': (['mirna', 'feature'], feature_matrix),
'model_results': (['mirna'], model_predictions),
'train_mask': (['mirna', 'split_type'], train_masks)
})
Benefits of unified storage
The beauty of this approach is that every piece of data is automatically aligned by shared coordinates. No more manual bookkeeping or index juggling across multiple files. When you slice your data, everything stays aligned automatically - you don't have to worry about applying the same filtering to all your files.
Store your unified dataset in Zarr format and it becomes cloud-ready, supporting chunking, compression, and parallel access. You can start with core experimental data, then progressively add statistical results, ML features, and data splits. Each stage builds on the previous coordinate system, so everything stays connected.
When to use xarray
Xarray shines for:
- Laboratory data with multiple experimental factors
- Time series with multiple dimensions
- Machine learning workflows with features, targets, and splits
- Geospatial data with coordinates and time
- Any data where you need to maintain relationships across multiple files
Format-specific recommendations
Data Type | Source of Truth | Why | Derivatives | When to Use |
---|---|---|---|---|
Tabular data | Parquet or Feather | Excellent compression, preserves types, language-agnostic | CSV for stakeholders, JSON for APIs | Most data science workflows |
High-dimensional data | Zarr (via xarray) | Cloud-native, preserves coordinates, scales to terabytes | NetCDF for scientific workflows | Laboratory data, time series, ML workflows |
Configuration | YAML or TOML | Human-readable, supports comments, version control friendly | JSON for programmatic access | Project configs, metadata |
Small datasets | CSV or JSON | Version control friendly, human readable | None needed | Prototyping, small analyses |
Large time series | Zarr | Efficient compression, supports chunking | CSV for specific time ranges | Sensor data, financial data |
How to implement this in practice
Now that you understand the format choices, here's how to actually implement this in your projects. I've found that the key is to start simple and build up complexity as you need it.
Start with your project structure
First, organize your data directories to reflect the source-of-truth principle. Important: Data should not live in your repository. Use external storage (S3, cloud storage, etc.) for your actual data, and only keep small subsamples in the repo for testing.
project/
├── data/ # Small test datasets only (gitignored)
│ ├── samples/ # Subsamples for unit testing
│ └── temp/ # Temporary files (gitignored)
├── scripts/ # Data processing scripts
└── config/ # Data access configuration
For your actual data, use external storage with clear paths:
s3://your-bucket/project-name/
├── raw/ # Original data as received
├── processed/ # Your binary source of truth
└── exports/ # Text derivatives for stakeholders
This keeps your repo clean while maintaining the same organizational principles.
Set up automated conversions
Don't manually convert formats every time. Set up simple scripts that regenerate your derivatives automatically:
# scripts/export_data.py
import pandas as pd
import boto3
from pathlib import Path
def export_derivatives():
"""Export binary data to text formats for stakeholders."""
s3 = boto3.client('s3')
bucket = "your-bucket"
# List processed files in S3
response = s3.list_objects_v2(
Bucket=bucket,
Prefix="project-name/processed/"
)
for obj in response.get('Contents', []):
if obj['Key'].endswith('.parquet'):
# Download and process
s3.download_file(bucket, obj['Key'], '/tmp/temp.parquet')
df = pd.read_parquet('/tmp/temp.parquet')
# Upload CSV derivative
csv_key = obj['Key'].replace('/processed/', '/exports/').replace('.parquet', '.csv')
df.to_csv('/tmp/temp.csv', index=False)
s3.upload_file('/tmp/temp.csv', bucket, csv_key)
Run this script whenever your source data changes, or better yet, hook it into your data processing pipeline.
Document your choices
Put this in your project README so everyone knows the rules:
## Data Formats
- **Source of truth**: Binary formats (Parquet/Feather) in S3 `processed/` folder
- **Derivatives**: Text formats (CSV/Excel) in S3 `exports/` folder
- **Rule**: Never edit CSV files directly - changes go back to source data
- **Local data**: Only small test samples in `data/samples/` (gitignored)
This eliminates confusion and sets clear expectations for your team.
Handle the performance trade-offs
For large datasets, you'll hit performance bottlenecks. Here's what I've learned:
Profile first: Use %timeit
in Jupyter or Python's cProfile
to see where your I/O is slow. You might be surprised - sometimes the bottleneck isn't where you think.
Chunk your data: For datasets that don't fit in memory, use chunking:
# Read large files in chunks
for chunk in pd.read_csv("huge_file.csv", chunksize=10000):
# Process each chunk
chunk.to_parquet(f"processed/chunk_{i}.parquet")
Use memory mapping: For very large files, consider memory mapping with libraries like mmap
or dask
.
The bottom line
The goal isn't to pick one format over another, but to establish a clear hierarchy where each format serves its purpose in your data pipeline. Your source of truth should be robust and precise, while your derivatives should be accessible and useful for the people who need them.
Choose formats that preserve your data integrity, serve your stakeholders, scale with your needs, and support your workflow. The right choices compound over time, making your projects more maintainable and your collaborations smoother.