Skip to content

Performance Pathologies

The performance conversation gets much better once you stop saying “the GPUs are underutilized” and start naming the stage that is stalling.

flowchart LR
  A[Batch wait] --> B[H2D copy]
  B --> C[Forward]
  C --> D[Backward]
  D --> E[Gradient sync]
  E --> F[Optimizer step]
  F --> G[Checkpoint / logging side work]
Any one of these segments can dominate. Your tuning plan should match the segment, not the vibe.
QuestionWhy
Is step time stable or bursty?Bursty often means I/O or background contention.
Is the slowdown rank-local or global?Local issues suggest hardware, data, or placement skew.
Does the gap appear before backward or during gradient sync?Separates compute inefficiency from communication bottlenecks.
Did the gap appear after a memory-saving change?Activation checkpointing and sharding can trade memory for latency.
def timed_train_step(model, optimizer, batch, timer, scaler=None):
with timer("h2d"):
batch = move_to_device(batch)
with timer("forward"):
outputs = model(batch["inputs"])
loss = compute_loss(outputs, batch["targets"])
with timer("backward"):
if scaler:
scaler.scale(loss).backward()
else:
loss.backward()
with timer("optimizer"):
if scaler:
scaler.step(optimizer)
scaler.update()
else:
optimizer.step()
optimizer.zero_grad(set_to_none=True)
return loss

The current official AMP docs and activation-checkpoint docs add useful precision:

  • with autocast, you should not manually call half() or bfloat16() on your model or inputs just to “do AMP right”
  • autocast should wrap the forward pass and loss computation; backward under autocast is not the recommended pattern
  • activation checkpointing still fundamentally trades compute for memory
  • preserving RNG state across activation-checkpoint recomputation improves determinism but can cost performance

That is useful interview material because it turns a vague “use mixed precision and checkpointing” answer into an operational one.

flowchart TD
  A[Throughput drop] --> B{Main symptom}
  B --> C[GPU idle before kernels]
  B --> D[Long backward tail]
  B --> E[Spiky step latency]
  B --> F[OOM after scaling]
  C --> G[Input pipeline or H2D issue]
  D --> H[All-reduce / bucketization / topology issue]
  E --> I[Checkpointing, storage, or noisy neighbor]
  F --> J[Activation, optimizer, or fragmentation issue]
Name the symptom, then narrow the subsystem. That is stronger than dumping generic tuning tips.
LeverUpsideRisk
larger batch sizebetter device occupancyoptimization behavior changes, memory pressure increases
mixed precisionmore throughput, lower memorynumerical edge cases, scaler handling
more loader workersbetter CPU parallelismoversubscription and context-switch overhead
gradient accumulationemulate larger global batchlonger optimizer feedback loop
activation checkpointingmemory reliefextra recompute increases latency
NCCL/env tuningbetter comm efficiencycluster-specific and hard to generalize live

A staff-level answer connects performance to platform economics:

  • cost per successful training hour
  • storage bandwidth consumed by checkpoints
  • cluster fragmentation caused by rigid topology requirements
  • debugging burden introduced by aggressive optimization