Skip to content

Input Pipeline and Sampling

Input pipelines are where many “working” training systems become incorrect or slow.

flowchart LR
  A[Source files / table] --> B[Dataset abstraction]
  B --> C[Rank-aware sampler]
  C --> D[DataLoader workers]
  D --> E[CPU decode / transform]
  E --> F[Pinned host buffers]
  F --> G[GPU transfer]
  G --> H[Trainer step]
Throughput problems can happen at any stage; debugging starts by isolating where the queue empties.

The minimum correctness bar:

  • each rank sees a disjoint shard for a given epoch
  • the union of shards approximates the intended epoch definition
  • reshuffling is deterministic by (seed, epoch)
  • resume does not silently repeat or skip unknown samples
class ResumeAwareDistributedSampler(DistributedSampler):
def __init__(self, dataset, num_replicas, rank, seed=0, consumed=0, **kwargs):
super().__init__(dataset, num_replicas=num_replicas, rank=rank, seed=seed, **kwargs)
self.consumed = consumed
def state_dict(self) -> dict[str, int]:
return {"epoch": self.epoch, "consumed": self.consumed}
def load_state_dict(self, state: dict[str, int]) -> None:
self.epoch = state["epoch"]
self.consumed = state["consumed"]
def __iter__(self):
indices = list(super().__iter__())
start = min(self.consumed, len(indices))
for index in indices[start:]:
yield index
self.consumed += 1

That code is not the only implementation. It is enough to show that sampler progress is part of recoverable state.

The current official torch.utils.data docs are especially useful for interview-level nuance:

  • DataLoader still exposes prefetch_factor and persistent_workers, both of which matter when you discuss loader throughput.
  • persistent_workers=True keeps worker processes alive after a dataset pass, which is often useful when startup cost is non-trivial.
  • DistributedSampler.set_epoch() is still required at the beginning of each epoch if you want shuffling to change across epochs; otherwise the ordering repeats.

That gives you an easy staff-level line:

“The sampler contract and the loader contract are different. The sampler defines correctness; the loader settings define how efficiently I realize that plan.”

There are two honest designs:

StrategyProsCons
restart at epoch boundarysimple, reproduciblewastes work after failures mid-epoch
resume at sample offsetlower redo costharder to make correct with transforms, streaming, and worker prefetch

In a short interview, say which one you are choosing and why. Do not imply they are equivalent.

flowchart TD
  A[GPU idle time rises] --> B{Why?}
  B --> C[Loader workers too slow]
  B --> D[Remote storage latency]
  B --> E[Expensive CPU transforms]
  B --> F[Small batch / excessive sync]
  C --> G[Increase workers, profile decode]
  D --> H[Cache, prefetch, local stage]
  E --> I[Vectorize or move work]
  F --> J[Adjust batch and overlap]
An interview answer improves dramatically once you can distinguish input starvation from distributed synchronization cost.
  • samples/sec per rank
  • batch wait time before forward pass
  • queue depth at loader and prefetch boundaries
  • host-to-device copy latency
  • per-worker exception rate
  • cache hit rate if local staging exists

If you only expose loss and gpu_utilization, you cannot diagnose the pipeline.

For streaming or unbounded datasets, the semantics change:

  • epoch boundaries may be logical rather than physical
  • exactly-once sample guarantees are usually unrealistic
  • backpressure matters more than perfect shuffle quality
  • checkpointing often tracks cursor position and shard lease metadata instead of plain indices

This is a good place to sound senior:

“I would optimize for at-least-once ingestion with bounded duplication and strong observability instead of pretending a streaming pipeline can cheaply behave like a finite random-access dataset.”