Parallelism Playbook

Candidates often list parallelism strategies. Senior candidates explain when each strategy becomes the least bad option.

Start With DDP

flowchart LR
  A[Full model on rank 0] --> B[Forward]
  C[Full model on rank 1] --> D[Forward]
  B --> E[Backward]
  D --> F[Backward]
  E --> G[All-reduce gradients]
  F --> G
  G --> H[Optimizer step on every rank]

Data parallelism is the most teachable baseline because compute is local and synchronization happens at gradient boundaries.

DDP is usually the right interview default because:

it matches common production practice
it keeps failure discussion legible
it isolates the first-order network cost to gradient synchronization
it gives you a clean path to sampler design and global batch math

When DDP Stops Being Enough

Symptom	Interpretation	Next move
model does not fit in device memory	parameter + optimizer state footprint dominates	consider FSDP or ZeRO-style sharding
all-reduce dominates step time	communication is the bottleneck	tune bucket sizing, overlap, topology awareness, or reduce model/data split pressure
activation memory spikes	forward graph is too large for local device budget	activation checkpointing, sequence parallelism, or pipeline partitioning
one stage idles while another computes	work is unevenly partitioned	rebalance stages or simplify topology

DDP vs FSDP

def wrap_model(model: nn.Module, cfg: TrainConfig) -> nn.Module:
    if cfg.parallelism == "ddp":
        return torch.nn.parallel.DistributedDataParallel(
            model,
            device_ids=[cfg.local_rank],
            output_device=cfg.local_rank,
            gradient_as_bucket_view=True,
        )
    if cfg.parallelism == "fsdp":
        return FSDP(
            model,
            auto_wrap_policy=size_based_auto_wrap_policy,
            mixed_precision=cfg.mixed_precision_policy,
            sharding_strategy=ShardingStrategy.FULL_SHARD,
        )
    raise ValueError(f"Unsupported mode: {cfg.parallelism}")

What to say out loud

DDP replicates parameters on every rank and is easier to debug.
FSDP shards parameters across data-parallel workers; the current docs describe it exactly that way.
FSDP reduces memory pressure but shifts complexity into wrap policy, state-dict handling, checkpoint formats, and performance tuning.
In a live interview, choosing DDP first is usually more correct than prematurely optimizing into a harder failure model.

Hybrid Parallelism

flowchart TD
  A[Node 0] --> B[Tensor Parallel Group 0]
  A --> C[Tensor Parallel Group 1]
  D[Node 1] --> E[Tensor Parallel Group 0]
  D --> F[Tensor Parallel Group 1]
  B --> G[Pipeline Stage 0]
  C --> G
  E --> H[Pipeline Stage 1]
  F --> H
  G --> I[Data Parallel Replica 0]
  H --> J[Data Parallel Replica 1]

Once tensor, pipeline, and data parallelism combine, the real work becomes communicator design and failure containment.

This is where staff-level language matters:

Tensor parallelism trades communication for larger layer capacity.
Pipeline parallelism trades bubble overhead and scheduling complexity for model fit.
Data parallelism trades replicated state for implementation simplicity.

The wrong answer is “use all of them for large models.” The right answer is “introduce only the extra axis needed to eliminate the current bottleneck.”

Current PyTorch Notes

The current FSDP docs still position FSDP as a sharding wrapper for data-parallel workers, while current DDP docs still emphasize that DDP itself does not partition input data. Together, that leads to a clean interview distinction:

DDP: replication + gradient sync
FSDP: sharding + more state-management complexity
sampler / loader: still your responsibility either way

Choosing A Strategy

Scenario	Best first choice	Why
medium model, commodity cluster	DDP	minimal operational complexity
model barely exceeds device memory	DDP + activation checkpointing	cheapest complexity increase
model substantially exceeds device memory	FSDP	memory savings without full hybrid topology
enormous model, dedicated infra	FSDP + tensor/pipeline parallelism	necessary but operationally heavier

Trick Question: “Why Not Just Increase Batch Size?”

Because increasing batch size is not a generic scaling fix.

It may change optimization behavior.
It may increase activation memory.
It may mask data-loader starvation without fixing it.
It may raise communication payloads if gradient accumulation is not used carefully.