← All writing

How I structure a PhD-scale data pipeline

Three years into my PhD, I've finally landed on a data setup that survives both reviewer #2 and the 2 a.m. cluster crash. Like most things in research, it's the third or fourth iteration, and the first two were embarrassingly bad. I'm writing this down partly so I don't forget, and partly so the next first-year in my group has something better to inherit than a Slack DM that says "oh, just look at how I did it for the last paper."

The whole thing is built around one principle: a result you can't re-derive in a single command is a result you can't trust. Everything below is in service of that.

1. Raw data is sacred and read-only

The single biggest mistake I see new students make is editing raw data files in place — cropping, renaming, "fixing" header fields. By the time you publish, you can't tell what came off the microscope and what you smoothed away on a Tuesday afternoon eight months ago.

My rule: raw/ is mounted read-only. Always. If a transformation is needed, it lives in code that reads from raw/ and writes to derived/. The transform is committed to git. The output goes to a content-addressed cache.

project/
├── raw/         # read-only. don't touch.
├── derived/     # generated. throwaway.
├── analysis/    # notebooks & scripts
├── figures/     # paper-ready plots
└── pipelines/   # the actual transforms

2. One pipeline, many entry points

I used to have a separate Snakefile, Makefile, and a folder of bash scripts that all did adjacent things. Now I have one DAG, written with Hamilton for the in-Python parts and Snakemake for the cluster-y parts.

The win is that any cached intermediate is queryable as a node. When a reviewer asks "what if you re-ran this with σ=0.5 instead of 1.0?", the answer is one CLI flag, not a half-day of detective work.

The graph is the documentation. If you can't draw your data flow as a DAG, you don't actually understand it yet.

3. Everything goes through MLflow, even the boring stuff

Reconstruction runs, ablations, even the toy figures for talks — all of it logs to a single MLflow tracking server. Tags include the git SHA, the input dataset hash, and a free-text note. The note is the thing future-me thanks me for the most.

4. Tests where it actually matters

I don't unit-test every helper function. I do test:

5. The cluster is not your laptop

Anything longer than 30 seconds runs on Slurm. Anything that runs on Slurm runs in a Docker image with a pinned digest. Anything in that image is built from a Dockerfile in the same repo as the code. There is no "I'll just install this one thing on the login node real quick."

What I'd do differently

I started caring about reproducibility too late — somewhere around month 14 — and I paid for it during my qualifying-exam prep, when I had to rebuild a six-month-old result from memory. If you're starting a PhD next fall, do the boring infrastructure work in your first month. You will never regret it.

— Cedric