Input Pipeline and Sampling

Input pipelines are where many “working” training systems become incorrect or slow.

The Data Path

flowchart LR
  A[Source files / table] --> B[Dataset abstraction]
  B --> C[Rank-aware sampler]
  C --> D[DataLoader workers]
  D --> E[CPU decode / transform]
  E --> F[Pinned host buffers]
  F --> G[GPU transfer]
  G --> H[Trainer step]

Throughput problems can happen at any stage; debugging starts by isolating where the queue empties.

Deterministic Distributed Sampling

The minimum correctness bar:

each rank sees a disjoint shard for a given epoch
the union of shards approximates the intended epoch definition
reshuffling is deterministic by (seed, epoch)
resume does not silently repeat or skip unknown samples

class ResumeAwareDistributedSampler(DistributedSampler):
    def __init__(self, dataset, num_replicas, rank, seed=0, consumed=0, **kwargs):
        super().__init__(dataset, num_replicas=num_replicas, rank=rank, seed=seed, **kwargs)
        self.consumed = consumed

    def state_dict(self) -> dict[str, int]:
        return {"epoch": self.epoch, "consumed": self.consumed}

    def load_state_dict(self, state: dict[str, int]) -> None:
        self.epoch = state["epoch"]
        self.consumed = state["consumed"]

    def __iter__(self):
        indices = list(super().__iter__())
        start = min(self.consumed, len(indices))
        for index in indices[start:]:
            yield index
            self.consumed += 1

That code is not the only implementation. It is enough to show that sampler progress is part of recoverable state.

Current PyTorch Notes

The current official torch.utils.data docs are especially useful for interview-level nuance:

DataLoader still exposes prefetch_factor and persistent_workers, both of which matter when you discuss loader throughput.
persistent_workers=True keeps worker processes alive after a dataset pass, which is often useful when startup cost is non-trivial.
DistributedSampler.set_epoch() is still required at the beginning of each epoch if you want shuffling to change across epochs; otherwise the ordering repeats.

That gives you an easy staff-level line:

“The sampler contract and the loader contract are different. The sampler defines correctness; the loader settings define how efficiently I realize that plan.”

A Staff-Level Resume Discussion

There are two honest designs:

Strategy	Pros	Cons
restart at epoch boundary	simple, reproducible	wastes work after failures mid-epoch
resume at sample offset	lower redo cost	harder to make correct with transforms, streaming, and worker prefetch

In a short interview, say which one you are choosing and why. Do not imply they are equivalent.

Throughput Pathologies

flowchart TD
  A[GPU idle time rises] --> B{Why?}
  B --> C[Loader workers too slow]
  B --> D[Remote storage latency]
  B --> E[Expensive CPU transforms]
  B --> F[Small batch / excessive sync]
  C --> G[Increase workers, profile decode]
  D --> H[Cache, prefetch, local stage]
  E --> I[Vectorize or move work]
  F --> J[Adjust batch and overlap]

An interview answer improves dramatically once you can distinguish input starvation from distributed synchronization cost.

Metrics That Actually Help

samples/sec per rank
batch wait time before forward pass
queue depth at loader and prefetch boundaries
host-to-device copy latency
per-worker exception rate
cache hit rate if local staging exists

If you only expose loss and gpu_utilization, you cannot diagnose the pipeline.

Streaming Datasets

For streaming or unbounded datasets, the semantics change:

epoch boundaries may be logical rather than physical
exactly-once sample guarantees are usually unrealistic
backpressure matters more than perfect shuffle quality
checkpointing often tracks cursor position and shard lease metadata instead of plain indices

This is a good place to sound senior:

“I would optimize for at-least-once ingestion with bounded duplication and strong observability instead of pretending a streaming pipeline can cheaply behave like a finite random-access dataset.”

References

PyTorch torch.utils.data docs