Skip to content

Parallelism Playbook

Candidates often list parallelism strategies. Senior candidates explain when each strategy becomes the least bad option.

flowchart LR
  A[Full model on rank 0] --> B[Forward]
  C[Full model on rank 1] --> D[Forward]
  B --> E[Backward]
  D --> F[Backward]
  E --> G[All-reduce gradients]
  F --> G
  G --> H[Optimizer step on every rank]
Data parallelism is the most teachable baseline because compute is local and synchronization happens at gradient boundaries.

DDP is usually the right interview default because:

  • it matches common production practice
  • it keeps failure discussion legible
  • it isolates the first-order network cost to gradient synchronization
  • it gives you a clean path to sampler design and global batch math

For CUDA training, NCCL is usually the communication layer sitting underneath PyTorch process groups and collectives.

  • DDP uses NCCL-style collectives to keep replicated model gradients in sync.
  • FSDP uses a different collective mix because parameters, gradients, and optimizer state are sharded instead of fully replicated.
  • When a training step goes communication-bound, you are usually reasoning about NCCL collective cost, overlap, topology, and payload size rather than just “PyTorch being slow.”

The shortest accurate mental model is:

NCCL is the GPU transport and collective engine; DDP and FSDP are the training semantics built on top of those collectives.

NCCL Collectives You Should Be Able To Narrate

Section titled “NCCL Collectives You Should Be Able To Narrate”
flowchart LR
  subgraph Inputs[Before]
      A0[GPU0: 1]
      A1[GPU1: 2]
      A2[GPU2: 3]
      A3[GPU3: 4]
  end
  X((sum = 10))
  subgraph Outputs[After]
      B0[GPU0: 10]
      B1[GPU1: 10]
      B2[GPU2: 10]
      B3[GPU3: 10]
  end
  A0 --> X
  A1 --> X
  A2 --> X
  A3 --> X
  X --> B0
  X --> B1
  X --> B2
  X --> B3
All-reduce combines values from every GPU and returns the reduced result to every GPU.

This is the collective most people should associate with classic DDP.

  • each rank contributes its local gradient bucket
  • NCCL reduces the bucket across ranks
  • every rank receives the same reduced value
  • each rank can then apply the same optimizer step locally
flowchart LR
  subgraph Inputs[Before]
      A0[GPU0: 1]
      A1[GPU1: 2]
      A2[GPU2: 3]
      A3[GPU3: 4]
  end
  X((sum = 10))
  B0[Root GPU0: 10]
  A0 --> X
  A1 --> X
  A2 --> X
  A3 --> X
  X --> B0
Reduce combines values across all GPUs but only the designated root GPU receives the reduced result.

Use this when one rank needs the combined answer and the others do not.

flowchart LR
  subgraph Inputs[Before]
      A0[GPU0: [1, 0, 0, 0]]
      A1[GPU1: [0, 2, 0, 0]]
      A2[GPU2: [0, 0, 3, 0]]
      A3[GPU3: [0, 0, 0, 4]]
  end
  X((reduced vector = [1, 2, 3, 4]))
  subgraph Outputs[After]
      B0[GPU0 slice: [1]]
      B1[GPU1 slice: [2]]
      B2[GPU2 slice: [3]]
      B3[GPU3 slice: [4]]
  end
  A0 --> X
  A1 --> X
  A2 --> X
  A3 --> X
  X --> B0
  X --> B1
  X --> B2
  X --> B3
Reduce-scatter first reduces across GPUs, then scatters one slice of the reduced result to each GPU.

This is the collective to associate with sharded training systems.

  • all ranks contribute partial values
  • the reduction happens globally
  • each rank keeps only its shard of the reduced tensor
  • that reduces memory traffic relative to broadcasting the full result everywhere
flowchart LR
  subgraph Inputs[Before]
      A0[GPU0 slice: [1]]
      A1[GPU1 slice: [2]]
      A2[GPU2 slice: [3]]
      A3[GPU3 slice: [4]]
  end
  X((gather slices))
  subgraph Outputs[After]
      B0[GPU0: [1, 2, 3, 4]]
      B1[GPU1: [1, 2, 3, 4]]
      B2[GPU2: [1, 2, 3, 4]]
      B3[GPU3: [1, 2, 3, 4]]
  end
  A0 --> X
  A1 --> X
  A2 --> X
  A3 --> X
  X --> B0
  X --> B1
  X --> B2
  X --> B3
All-gather collects a shard from every GPU and reconstructs the full tensor on every GPU without reducing values.

This is the inverse mental pattern of reduce-scatter: no arithmetic, just reassembly.

flowchart LR
  A[Rank-local forward] --> B[Rank-local backward]
  B --> C[NCCL all-reduce on gradient buckets]
  C --> D[Every rank has the same averaged gradients]
  D --> E[Optimizer step on every rank]
Classic DDP keeps a full model replica per rank and pays the main communication cost at gradient synchronization time.

The practical DDP story is:

  • parameters are replicated on every rank
  • activations and optimizer step logic are local to each rank
  • gradient buckets are synchronized with all-reduce
  • if all-reduce dominates the backward tail, you are now communication-bound

That is why DDP is so teachable: one process per GPU, one full model replica per process, and one dominant collective pattern.

SymptomInterpretationNext move
model does not fit in device memoryparameter + optimizer state footprint dominatesconsider FSDP or ZeRO-style sharding
all-reduce dominates step timecommunication is the bottlenecktune bucket sizing, overlap, topology awareness, or reduce model/data split pressure
activation memory spikesforward graph is too large for local device budgetactivation checkpointing, sequence parallelism, or pipeline partitioning
one stage idles while another computeswork is unevenly partitionedrebalance stages or simplify topology
def wrap_model(model: nn.Module, cfg: TrainConfig) -> nn.Module:
if cfg.parallelism == "ddp":
return torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[cfg.local_rank],
output_device=cfg.local_rank,
gradient_as_bucket_view=True,
)
if cfg.parallelism == "fsdp":
return FSDP(
model,
auto_wrap_policy=size_based_auto_wrap_policy,
mixed_precision=cfg.mixed_precision_policy,
sharding_strategy=ShardingStrategy.FULL_SHARD,
)
raise ValueError(f"Unsupported mode: {cfg.parallelism}")
  • DDP replicates parameters on every rank and is easier to debug.
  • FSDP shards parameters across data-parallel workers; the current docs describe it exactly that way.
  • FSDP reduces memory pressure but shifts complexity into wrap policy, state-dict handling, checkpoint formats, and performance tuning.
  • In a live interview, choosing DDP first is usually more correct than prematurely optimizing into a harder failure model.

How FSDP Changes The Communication Pattern

Section titled “How FSDP Changes The Communication Pattern”
flowchart LR
  A[Parameter shards live on different ranks] --> B[All-gather full params for active module]
  B --> C[Forward and backward for that module]
  C --> D[Reduce-scatter gradients back into shards]
  D --> E[Sharded optimizer state update]
FSDP trades DDP's always-replicated model state for a repeated all-gather and reduce-scatter pattern around wrapped modules.

That leads to the clean interview contrast:

  • DDP mostly feels like replicated compute plus gradient all-reduce
  • FSDP feels like sharded state plus all-gather and reduce-scatter around compute boundaries
  • both still depend on the same underlying collective primitives, but the traffic pattern is very different
flowchart TD
  A[Node 0] --> B[Tensor Parallel Group 0]
  A --> C[Tensor Parallel Group 1]
  D[Node 1] --> E[Tensor Parallel Group 0]
  D --> F[Tensor Parallel Group 1]
  B --> G[Pipeline Stage 0]
  C --> G
  E --> H[Pipeline Stage 1]
  F --> H
  G --> I[Data Parallel Replica 0]
  H --> J[Data Parallel Replica 1]
Once tensor, pipeline, and data parallelism combine, the real work becomes communicator design and failure containment.

This is where staff-level language matters:

  • Tensor parallelism trades communication for larger layer capacity.
  • Pipeline parallelism trades bubble overhead and scheduling complexity for model fit.
  • Data parallelism trades replicated state for implementation simplicity.

The wrong answer is “use all of them for large models.” The right answer is “introduce only the extra axis needed to eliminate the current bottleneck.”

The current FSDP docs still position FSDP as a sharding wrapper for data-parallel workers, while current DDP docs still emphasize that DDP itself does not partition input data. Together, that leads to a clean interview distinction:

  • DDP: replication + gradient sync
  • FSDP: sharding + more state-management complexity
  • sampler / loader: still your responsibility either way
ScenarioBest first choiceWhy
medium model, commodity clusterDDPminimal operational complexity
model barely exceeds device memoryDDP + activation checkpointingcheapest complexity increase
model substantially exceeds device memoryFSDPmemory savings without full hybrid topology
enormous model, dedicated infraFSDP + tensor/pipeline parallelismnecessary but operationally heavier

Trick Question: “Why Not Just Increase Batch Size?”

Section titled “Trick Question: “Why Not Just Increase Batch Size?””

Because increasing batch size is not a generic scaling fix.

  • It may change optimization behavior.
  • It may increase activation memory.
  • It may mask data-loader starvation without fixing it.
  • It may raise communication payloads if gradient accumulation is not used carefully.