There's a failure mode every bioinformatician has hit at least once.
You're running a WGS cohort through GATK HaplotypeCaller. Halfway through, the cluster runs out of
memory, a node dies, or someone accidentally kills the job. You fix the issue and rerun. The pipeline
"completes." Variant calls look off. You spend two days debugging before realizing the BAM you were
calling variants on was a concatenation of a partial previous run and the new one.
That's an
idempotency failure. And it's almost never discussed in genomics pipeline engineering conversations,
even though it's responsible for a significant fraction of "mysterious" pipeline bugs.
What Idempotency Actually Means in a Genomics Context
In mathematics, a function is idempotent if applying it twice gives the same result as applying it once.
In genomics pipeline engineering, the
practical definition is simpler: running a step twice with the same inputs produces the same output, and
doesn't corrupt anything.
A function f is idempotent if,
f(f(x))
= f(x)
This sounds obvious. It isn't. Most genomics pipelines we build are
idempotent in the happy path, when everything runs clean from scratch. They break idempotency exactly
when you need it most: during reruns after partial failures.
The three canonical failure
modes are:
-
Append-on-rerun. A step writes to an output file that already exists from a partial
run. Instead of overwriting, it appends. Your variants.vcf now has duplicate header lines. VCF
parsers downstream either crash or silently misparse records, and you have no idea which one is
happening.
- Stale-output skip. Your pipeline or workflow manager detects an output file exists and skips the step entirely. The output file is empty, truncated, or from a different input version. The pipeline proceeds on corrupted data with no warning.
- Partial-write corruption. A step writes to a temp file and then moves it to the final location. The process dies mid-write. The temp file exists, partially written. On rerun, the step finds the file, assumes it is a valid output, and skips the work.
Each of these is insidious for the same reason: the pipeline doesn't error. It continues. You find out
later, sometimes much later, sometimes after clinical reporting.
Why GATK Steps Are Especially Vulnerable
GATK tools have their own opinions about output handling that interact with naive pipeline logic.
-
HaplotypeCaller and index desynchronization. If a run dies mid-write to a GVCF, the
index file (.tbi) can become out of sync with the data file. On rerun, if your pipeline checks for
file existence but not integrity, you'll feed a corrupt .g.vcf into joint genotyping and get wrong
results with no error message.
- GenomicsDBImport is particularly unforgiving. It writes to a workspace directory rather than a single file. If that workspace exists from a partial run, GenomicsDBImport will either fail with a cryptic error or, worse, attempt to merge into the partial state. Most pipeline implementations don't handle this automatically, they leave cleanup as a manual step, which gets skipped under pressure.
- The BQSR two-step trap. BaseRecalibrator generates a recalibration table; ApplyBQSR consumes it. If BaseRecalibrator completes but ApplyBQSR fails, you have a valid recal table from run 1. On rerun, if your logic finds the recal table and skips BaseRecalibrator but reruns ApplyBQSR against an updated input BAM, you silently apply recalibration computed from the wrong input. The results aren't obviously wrong. They're subtly wrong.
Every Step Should Be a Clean Room
The most useful way to think about idempotency in practice is to treat each pipeline step as a clean
room operation; it should behave identically whether the output directory is empty or contains files
from a previous run. This means three things:
-
Remove before you write, not after. The instinct is to clean up after a step
completes. But cleanup after a successful run doesn't help you when the failure happens mid-run. The
cleanup needs to happen at the beginning of the step, before any work starts. If an output file
exists from a previous attempt, delete it before writing the new one, not as a workaround, but as
the designed behavior.
- Separate completion signals from output files. A file existing is not proof that a step completed successfully. An empty file, a truncated file, and a fully written file look identical to a script checking for file existence. The only reliable signal of clean completion is something you write explicitly at the end of a successful run, separate from the output itself. Most mature pipelines implement some form of this, whether it's a sentinel file, a checksum file, or a database entry marking the step done.
- Validate outputs before moving on. GATK provides ValidateVariants. Samtools provides quickcheck for BAMs. Most tools have some equivalent. The question of whether to use them isn't really about compute overhead, it's about whether you want to find corruption at the step that caused it, or three steps downstream when the error message is no longer traceable.
The Nextflow-Specific Caveat Most Teams Miss
Nextflow's -resume flag is powerful and it's one of the reasons Nextflow is so widely adopted in
genomics. But it's frequently misunderstood as an idempotency guarantee. It isn't.
-resume
reuses cached task outputs based on a hash of the task's inputs and script. What it does not account for
is whether the output files were modified after caching, whether reference files were updated in place
under the same path, or whether the work directory was partially cleaned by a storage quota enforcement
job.
-resume is a performance optimization. It lets you skip work you've already done.
Idempotency is a correctness guarantee. It ensures that work you redo produces the same result. These
are different things, and conflating them is how teams end up trusting cached outputs that are quietly
wrong.
The practical implication
-resume is safe when
you can verify your work directories are intact and your inputs haven't changed. When you can't verify
that, mid-cohort failure after a storage incident, for example, rerunning from scratch with clean output
directories is often the more defensible choice than resuming.
Where Idempotency Matters Most in a Clinical Context?
In a research setting, idempotency failures are annoying and waste time. In a clinical setting, they
carry a different kind of risk.
Consider a rare disease diagnostic pipeline running trio
analysis. A partial failure during joint genotyping creates a corrupted joint VCF. The pipeline resumes,
ACMG classification runs on the corrupt output, and a variant that should have been flagged as Likely
Pathogenic is either missing or miscategorized. The report goes out. Nothing in the pipeline log
indicates a problem.
This is not a hypothetical failure mode, it's a realistic consequence of
non-idempotent pipelines operating under time pressure. The fix isn't to run more slowly or more
carefully. It's to build pipelines where this class of failure is structurally impossible: where every
step either completes cleanly or fails loudly, and where partial outputs never propagate downstream.
The Operational Payoff
Idempotent pipelines change the failure recovery calculus entirely. When a step fails mid-cohort, the
correct action becomes simple: fix the issue, rerun from that step. With genuine idempotency, you're
confident the outputs are clean without auditing individual files. Without it, every rerun requires a
manual review of what completed, what's partial, and what's safe to carry forward.
The 2-hour
versus 2-day investigation gap isn't hypothetical. It compounds across a cohort, 50 samples, 8 pipeline
steps each, a few partial failures per month. That's hundreds of hours of debugging time per year that
largely disappear when your pipeline is built with idempotency as a first-class design requirement.
The
patterns that accomplish this, atomic writes, explicit completion signals, output validation at step
boundaries, treating GenomicsDBImport workspaces as disposable, aren't complex to implement. They're
just rarely prioritized until after the first major incident. Build them in before that incident
happens.
The NonStop Promise
At NonStop, we don't just build software - we build systems that scale, adapt, and endure. Every platform we deliver is engineered to handle real-world complexity, regulatory rigor, and long-term growth. From architecture to execution, our promise is simple: clarity in decisions, confidence in delivery, and technology that keeps your business moving forward.