Parallelism Playbook
Candidates often list parallelism strategies. Senior candidates explain when each strategy becomes the least bad option.
Start With DDP
Section titled “Start With DDP”flowchart LR A[Full model on rank 0] --> B[Forward] C[Full model on rank 1] --> D[Forward] B --> E[Backward] D --> F[Backward] E --> G[All-reduce gradients] F --> G G --> H[Optimizer step on every rank]
DDP is usually the right interview default because:
- it matches common production practice
- it keeps failure discussion legible
- it isolates the first-order network cost to gradient synchronization
- it gives you a clean path to sampler design and global batch math
Where NCCL Fits
Section titled “Where NCCL Fits”For CUDA training, NCCL is usually the communication layer sitting underneath PyTorch process groups and collectives.
- DDP uses NCCL-style collectives to keep replicated model gradients in sync.
- FSDP uses a different collective mix because parameters, gradients, and optimizer state are sharded instead of fully replicated.
- When a training step goes communication-bound, you are usually reasoning about NCCL collective cost, overlap, topology, and payload size rather than just “PyTorch being slow.”
The shortest accurate mental model is:
NCCL is the GPU transport and collective engine; DDP and FSDP are the training semantics built on top of those collectives.
NCCL Collectives You Should Be Able To Narrate
Section titled “NCCL Collectives You Should Be Able To Narrate”All-reduce
Section titled “All-reduce”
flowchart LR
subgraph Inputs[Before]
A0[GPU0: 1]
A1[GPU1: 2]
A2[GPU2: 3]
A3[GPU3: 4]
end
X((sum = 10))
subgraph Outputs[After]
B0[GPU0: 10]
B1[GPU1: 10]
B2[GPU2: 10]
B3[GPU3: 10]
end
A0 --> X
A1 --> X
A2 --> X
A3 --> X
X --> B0
X --> B1
X --> B2
X --> B3
This is the collective most people should associate with classic DDP.
- each rank contributes its local gradient bucket
- NCCL reduces the bucket across ranks
- every rank receives the same reduced value
- each rank can then apply the same optimizer step locally
Reduce
Section titled “Reduce”
flowchart LR
subgraph Inputs[Before]
A0[GPU0: 1]
A1[GPU1: 2]
A2[GPU2: 3]
A3[GPU3: 4]
end
X((sum = 10))
B0[Root GPU0: 10]
A0 --> X
A1 --> X
A2 --> X
A3 --> X
X --> B0
Use this when one rank needs the combined answer and the others do not.
Reduce-scatter
Section titled “Reduce-scatter”
flowchart LR
subgraph Inputs[Before]
A0[GPU0: [1, 0, 0, 0]]
A1[GPU1: [0, 2, 0, 0]]
A2[GPU2: [0, 0, 3, 0]]
A3[GPU3: [0, 0, 0, 4]]
end
X((reduced vector = [1, 2, 3, 4]))
subgraph Outputs[After]
B0[GPU0 slice: [1]]
B1[GPU1 slice: [2]]
B2[GPU2 slice: [3]]
B3[GPU3 slice: [4]]
end
A0 --> X
A1 --> X
A2 --> X
A3 --> X
X --> B0
X --> B1
X --> B2
X --> B3
This is the collective to associate with sharded training systems.
- all ranks contribute partial values
- the reduction happens globally
- each rank keeps only its shard of the reduced tensor
- that reduces memory traffic relative to broadcasting the full result everywhere
All-gather
Section titled “All-gather”
flowchart LR
subgraph Inputs[Before]
A0[GPU0 slice: [1]]
A1[GPU1 slice: [2]]
A2[GPU2 slice: [3]]
A3[GPU3 slice: [4]]
end
X((gather slices))
subgraph Outputs[After]
B0[GPU0: [1, 2, 3, 4]]
B1[GPU1: [1, 2, 3, 4]]
B2[GPU2: [1, 2, 3, 4]]
B3[GPU3: [1, 2, 3, 4]]
end
A0 --> X
A1 --> X
A2 --> X
A3 --> X
X --> B0
X --> B1
X --> B2
X --> B3
This is the inverse mental pattern of reduce-scatter: no arithmetic, just reassembly.
How DDP Uses NCCL
Section titled “How DDP Uses NCCL”flowchart LR A[Rank-local forward] --> B[Rank-local backward] B --> C[NCCL all-reduce on gradient buckets] C --> D[Every rank has the same averaged gradients] D --> E[Optimizer step on every rank]
The practical DDP story is:
- parameters are replicated on every rank
- activations and optimizer step logic are local to each rank
- gradient buckets are synchronized with all-reduce
- if all-reduce dominates the backward tail, you are now communication-bound
That is why DDP is so teachable: one process per GPU, one full model replica per process, and one dominant collective pattern.
When DDP Stops Being Enough
Section titled “When DDP Stops Being Enough”| Symptom | Interpretation | Next move |
|---|---|---|
| model does not fit in device memory | parameter + optimizer state footprint dominates | consider FSDP or ZeRO-style sharding |
| all-reduce dominates step time | communication is the bottleneck | tune bucket sizing, overlap, topology awareness, or reduce model/data split pressure |
| activation memory spikes | forward graph is too large for local device budget | activation checkpointing, sequence parallelism, or pipeline partitioning |
| one stage idles while another computes | work is unevenly partitioned | rebalance stages or simplify topology |
DDP vs FSDP
Section titled “DDP vs FSDP”def wrap_model(model: nn.Module, cfg: TrainConfig) -> nn.Module: if cfg.parallelism == "ddp": return torch.nn.parallel.DistributedDataParallel( model, device_ids=[cfg.local_rank], output_device=cfg.local_rank, gradient_as_bucket_view=True, ) if cfg.parallelism == "fsdp": return FSDP( model, auto_wrap_policy=size_based_auto_wrap_policy, mixed_precision=cfg.mixed_precision_policy, sharding_strategy=ShardingStrategy.FULL_SHARD, ) raise ValueError(f"Unsupported mode: {cfg.parallelism}")What to say out loud
Section titled “What to say out loud”- DDP replicates parameters on every rank and is easier to debug.
- FSDP shards parameters across data-parallel workers; the current docs describe it exactly that way.
- FSDP reduces memory pressure but shifts complexity into wrap policy, state-dict handling, checkpoint formats, and performance tuning.
- In a live interview, choosing DDP first is usually more correct than prematurely optimizing into a harder failure model.
How FSDP Changes The Communication Pattern
Section titled “How FSDP Changes The Communication Pattern”flowchart LR A[Parameter shards live on different ranks] --> B[All-gather full params for active module] B --> C[Forward and backward for that module] C --> D[Reduce-scatter gradients back into shards] D --> E[Sharded optimizer state update]
That leads to the clean interview contrast:
- DDP mostly feels like replicated compute plus gradient all-reduce
- FSDP feels like sharded state plus all-gather and reduce-scatter around compute boundaries
- both still depend on the same underlying collective primitives, but the traffic pattern is very different
Hybrid Parallelism
Section titled “Hybrid Parallelism”flowchart TD A[Node 0] --> B[Tensor Parallel Group 0] A --> C[Tensor Parallel Group 1] D[Node 1] --> E[Tensor Parallel Group 0] D --> F[Tensor Parallel Group 1] B --> G[Pipeline Stage 0] C --> G E --> H[Pipeline Stage 1] F --> H G --> I[Data Parallel Replica 0] H --> J[Data Parallel Replica 1]
This is where staff-level language matters:
- Tensor parallelism trades communication for larger layer capacity.
- Pipeline parallelism trades bubble overhead and scheduling complexity for model fit.
- Data parallelism trades replicated state for implementation simplicity.
The wrong answer is “use all of them for large models.” The right answer is “introduce only the extra axis needed to eliminate the current bottleneck.”
Current PyTorch Notes
Section titled “Current PyTorch Notes”The current FSDP docs still position FSDP as a sharding wrapper for data-parallel workers, while current DDP docs still emphasize that DDP itself does not partition input data. Together, that leads to a clean interview distinction:
- DDP: replication + gradient sync
- FSDP: sharding + more state-management complexity
- sampler / loader: still your responsibility either way
Choosing A Strategy
Section titled “Choosing A Strategy”| Scenario | Best first choice | Why |
|---|---|---|
| medium model, commodity cluster | DDP | minimal operational complexity |
| model barely exceeds device memory | DDP + activation checkpointing | cheapest complexity increase |
| model substantially exceeds device memory | FSDP | memory savings without full hybrid topology |
| enormous model, dedicated infra | FSDP + tensor/pipeline parallelism | necessary but operationally heavier |
Trick Question: “Why Not Just Increase Batch Size?”
Section titled “Trick Question: “Why Not Just Increase Batch Size?””Because increasing batch size is not a generic scaling fix.
- It may change optimization behavior.
- It may increase activation memory.
- It may mask data-loader starvation without fixing it.
- It may raise communication payloads if gradient accumulation is not used carefully.