Input Pipeline and Sampling
Input pipelines are where many “working” training systems become incorrect or slow.
The Data Path
Section titled “The Data Path”flowchart LR A[Source files / table] --> B[Dataset abstraction] B --> C[Rank-aware sampler] C --> D[DataLoader workers] D --> E[CPU decode / transform] E --> F[Pinned host buffers] F --> G[GPU transfer] G --> H[Trainer step]
Deterministic Distributed Sampling
Section titled “Deterministic Distributed Sampling”The minimum correctness bar:
- each rank sees a disjoint shard for a given epoch
- the union of shards approximates the intended epoch definition
- reshuffling is deterministic by
(seed, epoch) - resume does not silently repeat or skip unknown samples
class ResumeAwareDistributedSampler(DistributedSampler): def __init__(self, dataset, num_replicas, rank, seed=0, consumed=0, **kwargs): super().__init__(dataset, num_replicas=num_replicas, rank=rank, seed=seed, **kwargs) self.consumed = consumed
def state_dict(self) -> dict[str, int]: return {"epoch": self.epoch, "consumed": self.consumed}
def load_state_dict(self, state: dict[str, int]) -> None: self.epoch = state["epoch"] self.consumed = state["consumed"]
def __iter__(self): indices = list(super().__iter__()) start = min(self.consumed, len(indices)) for index in indices[start:]: yield index self.consumed += 1That code is not the only implementation. It is enough to show that sampler progress is part of recoverable state.
Current PyTorch Notes
Section titled “Current PyTorch Notes”The current official torch.utils.data docs are especially useful for interview-level nuance:
DataLoaderstill exposesprefetch_factorandpersistent_workers, both of which matter when you discuss loader throughput.persistent_workers=Truekeeps worker processes alive after a dataset pass, which is often useful when startup cost is non-trivial.DistributedSampler.set_epoch()is still required at the beginning of each epoch if you want shuffling to change across epochs; otherwise the ordering repeats.
That gives you an easy staff-level line:
“The sampler contract and the loader contract are different. The sampler defines correctness; the loader settings define how efficiently I realize that plan.”
A Staff-Level Resume Discussion
Section titled “A Staff-Level Resume Discussion”There are two honest designs:
| Strategy | Pros | Cons |
|---|---|---|
| restart at epoch boundary | simple, reproducible | wastes work after failures mid-epoch |
| resume at sample offset | lower redo cost | harder to make correct with transforms, streaming, and worker prefetch |
In a short interview, say which one you are choosing and why. Do not imply they are equivalent.
Throughput Pathologies
Section titled “Throughput Pathologies”
flowchart TD
A[GPU idle time rises] --> B{Why?}
B --> C[Loader workers too slow]
B --> D[Remote storage latency]
B --> E[Expensive CPU transforms]
B --> F[Small batch / excessive sync]
C --> G[Increase workers, profile decode]
D --> H[Cache, prefetch, local stage]
E --> I[Vectorize or move work]
F --> J[Adjust batch and overlap]
Metrics That Actually Help
Section titled “Metrics That Actually Help”- samples/sec per rank
- batch wait time before forward pass
- queue depth at loader and prefetch boundaries
- host-to-device copy latency
- per-worker exception rate
- cache hit rate if local staging exists
If you only expose loss and gpu_utilization, you cannot diagnose the pipeline.
Streaming Datasets
Section titled “Streaming Datasets”For streaming or unbounded datasets, the semantics change:
- epoch boundaries may be logical rather than physical
- exactly-once sample guarantees are usually unrealistic
- backpressure matters more than perfect shuffle quality
- checkpointing often tracks cursor position and shard lease metadata instead of plain indices
This is a good place to sound senior:
“I would optimize for at-least-once ingestion with bounded duplication and strong observability instead of pretending a streaming pipeline can cheaply behave like a finite random-access dataset.”