Performance Pathologies

The performance conversation gets much better once you stop saying “the GPUs are underutilized” and start naming the stage that is stalling.

Step Time Decomposition

flowchart LR
  A[Batch wait] --> B[H2D copy]
  B --> C[Forward]
  C --> D[Backward]
  D --> E[Gradient sync]
  E --> F[Optimizer step]
  F --> G[Checkpoint / logging side work]

Any one of these segments can dominate. Your tuning plan should match the segment, not the vibe.

First Questions To Ask

Question	Why
Is step time stable or bursty?	Bursty often means I/O or background contention.
Is the slowdown rank-local or global?	Local issues suggest hardware, data, or placement skew.
Does the gap appear before backward or during gradient sync?	Separates compute inefficiency from communication bottlenecks.
Did the gap appear after a memory-saving change?	Activation checkpointing and sharding can trade memory for latency.

Instrumented Training Step

def timed_train_step(model, optimizer, batch, timer, scaler=None):
    with timer("h2d"):
        batch = move_to_device(batch)

    with timer("forward"):
        outputs = model(batch["inputs"])
        loss = compute_loss(outputs, batch["targets"])

    with timer("backward"):
        if scaler:
            scaler.scale(loss).backward()
        else:
            loss.backward()

    with timer("optimizer"):
        if scaler:
            scaler.step(optimizer)
            scaler.update()
        else:
            optimizer.step()
        optimizer.zero_grad(set_to_none=True)

    return loss

Current PyTorch Notes

The current official AMP docs and activation-checkpoint docs add useful precision:

with autocast, you should not manually call half() or bfloat16() on your model or inputs just to “do AMP right”
autocast should wrap the forward pass and loss computation; backward under autocast is not the recommended pattern
activation checkpointing still fundamentally trades compute for memory
preserving RNG state across activation-checkpoint recomputation improves determinism but can cost performance

That is useful interview material because it turns a vague “use mixed precision and checkpointing” answer into an operational one.

Common Bottleneck Patterns

flowchart TD
  A[Throughput drop] --> B{Main symptom}
  B --> C[GPU idle before kernels]
  B --> D[Long backward tail]
  B --> E[Spiky step latency]
  B --> F[OOM after scaling]
  C --> G[Input pipeline or H2D issue]
  D --> H[All-reduce / bucketization / topology issue]
  E --> I[Checkpointing, storage, or noisy neighbor]
  F --> J[Activation, optimizer, or fragmentation issue]

Name the symptom, then narrow the subsystem. That is stronger than dumping generic tuning tips.

Tuning Levers With Honest Tradeoffs

Lever	Upside	Risk
larger batch size	better device occupancy	optimization behavior changes, memory pressure increases
mixed precision	more throughput, lower memory	numerical edge cases, scaler handling
more loader workers	better CPU parallelism	oversubscription and context-switch overhead
gradient accumulation	emulate larger global batch	longer optimizer feedback loop
activation checkpointing	memory relief	extra recompute increases latency
NCCL/env tuning	better comm efficiency	cluster-specific and hard to generalize live

The Staff Angle

A staff-level answer connects performance to platform economics:

cost per successful training hour
storage bandwidth consumed by checkpoints
cluster fragmentation caused by rigid topology requirements
debugging burden introduced by aggressive optimization