Skip to content

Parallelism Playbook

Candidates often list parallelism strategies. Senior candidates explain when each strategy becomes the least bad option.

flowchart LR
  A[Full model on rank 0] --> B[Forward]
  C[Full model on rank 1] --> D[Forward]
  B --> E[Backward]
  D --> F[Backward]
  E --> G[All-reduce gradients]
  F --> G
  G --> H[Optimizer step on every rank]
Data parallelism is the most teachable baseline because compute is local and synchronization happens at gradient boundaries.

DDP is usually the right interview default because:

  • it matches common production practice
  • it keeps failure discussion legible
  • it isolates the first-order network cost to gradient synchronization
  • it gives you a clean path to sampler design and global batch math
SymptomInterpretationNext move
model does not fit in device memoryparameter + optimizer state footprint dominatesconsider FSDP or ZeRO-style sharding
all-reduce dominates step timecommunication is the bottlenecktune bucket sizing, overlap, topology awareness, or reduce model/data split pressure
activation memory spikesforward graph is too large for local device budgetactivation checkpointing, sequence parallelism, or pipeline partitioning
one stage idles while another computeswork is unevenly partitionedrebalance stages or simplify topology
def wrap_model(model: nn.Module, cfg: TrainConfig) -> nn.Module:
if cfg.parallelism == "ddp":
return torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[cfg.local_rank],
output_device=cfg.local_rank,
gradient_as_bucket_view=True,
)
if cfg.parallelism == "fsdp":
return FSDP(
model,
auto_wrap_policy=size_based_auto_wrap_policy,
mixed_precision=cfg.mixed_precision_policy,
sharding_strategy=ShardingStrategy.FULL_SHARD,
)
raise ValueError(f"Unsupported mode: {cfg.parallelism}")
  • DDP replicates parameters on every rank and is easier to debug.
  • FSDP shards parameters across data-parallel workers; the current docs describe it exactly that way.
  • FSDP reduces memory pressure but shifts complexity into wrap policy, state-dict handling, checkpoint formats, and performance tuning.
  • In a live interview, choosing DDP first is usually more correct than prematurely optimizing into a harder failure model.
flowchart TD
  A[Node 0] --> B[Tensor Parallel Group 0]
  A --> C[Tensor Parallel Group 1]
  D[Node 1] --> E[Tensor Parallel Group 0]
  D --> F[Tensor Parallel Group 1]
  B --> G[Pipeline Stage 0]
  C --> G
  E --> H[Pipeline Stage 1]
  F --> H
  G --> I[Data Parallel Replica 0]
  H --> J[Data Parallel Replica 1]
Once tensor, pipeline, and data parallelism combine, the real work becomes communicator design and failure containment.

This is where staff-level language matters:

  • Tensor parallelism trades communication for larger layer capacity.
  • Pipeline parallelism trades bubble overhead and scheduling complexity for model fit.
  • Data parallelism trades replicated state for implementation simplicity.

The wrong answer is “use all of them for large models.” The right answer is “introduce only the extra axis needed to eliminate the current bottleneck.”

The current FSDP docs still position FSDP as a sharding wrapper for data-parallel workers, while current DDP docs still emphasize that DDP itself does not partition input data. Together, that leads to a clean interview distinction:

  • DDP: replication + gradient sync
  • FSDP: sharding + more state-management complexity
  • sampler / loader: still your responsibility either way
ScenarioBest first choiceWhy
medium model, commodity clusterDDPminimal operational complexity
model barely exceeds device memoryDDP + activation checkpointingcheapest complexity increase
model substantially exceeds device memoryFSDPmemory savings without full hybrid topology
enormous model, dedicated infraFSDP + tensor/pipeline parallelismnecessary but operationally heavier

Trick Question: “Why Not Just Increase Batch Size?”

Section titled “Trick Question: “Why Not Just Increase Batch Size?””

Because increasing batch size is not a generic scaling fix.

  • It may change optimization behavior.
  • It may increase activation memory.
  • It may mask data-loader starvation without fixing it.
  • It may raise communication payloads if gradient accumulation is not used carefully.