How I structure a PhD-scale data pipeline
Three years into my PhD, I've finally landed on a data setup that survives both reviewer #2 and the 2 a.m. cluster crash. Like most things in research, it's the third or fourth iteration, and the first two were embarrassingly bad. I'm writing this down partly so I don't forget, and partly so the next first-year in my group has something better to inherit than a Slack DM that says "oh, just look at how I did it for the last paper."
The whole thing is built around one principle: a result you can't re-derive in a single command is a result you can't trust. Everything below is in service of that.
1. Raw data is sacred and read-only
The single biggest mistake I see new students make is editing raw data files in place — cropping, renaming, "fixing" header fields. By the time you publish, you can't tell what came off the microscope and what you smoothed away on a Tuesday afternoon eight months ago.
My rule: raw/ is mounted read-only. Always. If a transformation is needed, it lives in code that reads from raw/ and writes to derived/. The transform is committed to git. The output goes to a content-addressed cache.
project/
├── raw/ # read-only. don't touch.
├── derived/ # generated. throwaway.
├── analysis/ # notebooks & scripts
├── figures/ # paper-ready plots
└── pipelines/ # the actual transforms
2. One pipeline, many entry points
I used to have a separate Snakefile, Makefile, and a folder of bash scripts that all did adjacent things. Now I have one DAG, written with Hamilton for the in-Python parts and Snakemake for the cluster-y parts.
The win is that any cached intermediate is queryable as a node. When a reviewer asks "what if you re-ran this with σ=0.5 instead of 1.0?", the answer is one CLI flag, not a half-day of detective work.
The graph is the documentation. If you can't draw your data flow as a DAG, you don't actually understand it yet.
3. Everything goes through MLflow, even the boring stuff
Reconstruction runs, ablations, even the toy figures for talks — all of it logs to a single MLflow tracking server. Tags include the git SHA, the input dataset hash, and a free-text note. The note is the thing future-me thanks me for the most.
4. Tests where it actually matters
I don't unit-test every helper function. I do test:
- Loaders — round-tripping a known file should yield bit-identical bytes.
- Geometry — coordinate transforms have analytic answers; assert against them.
- Conservation laws — the things physics says shouldn't change, shouldn't.
5. The cluster is not your laptop
Anything longer than 30 seconds runs on Slurm. Anything that runs on Slurm runs in a Docker image with a pinned digest. Anything in that image is built from a Dockerfile in the same repo as the code. There is no "I'll just install this one thing on the login node real quick."
What I'd do differently
I started caring about reproducibility too late — somewhere around month 14 — and I paid for it during my qualifying-exam prep, when I had to rebuild a six-month-old result from memory. If you're starting a PhD next fall, do the boring infrastructure work in your first month. You will never regret it.
— Cedric