Observability and Debugging
The interview signal here is not whether you know a specific observability vendor. It is whether you know what must be observable in a distributed trainer.
The Telemetry Graph
Section titled “The Telemetry Graph”flowchart LR A[Trainer ranks] --> B[Structured logs] A --> C[Metrics] A --> D[Spans / traces] B --> E[Central log store] C --> F[Time-series backend] D --> G[Trace backend] E --> H[Alerts + dashboards] F --> H G --> H
Metrics Worth Emitting
Section titled “Metrics Worth Emitting”Training semantics
Section titled “Training semantics”- global step
- loss
- learning rate
- gradient norm
- skipped-step count for AMP / overflow cases
Systems behavior
Section titled “Systems behavior”- step time by phase
- samples/sec
- all-reduce time
- data-loader wait time
- checkpoint save latency
- GPU memory allocated and reserved
Reliability
Section titled “Reliability”- restart count
- checkpoint age
- failed collective count
- per-rank heartbeat freshness
Logging Policy
Section titled “Logging Policy”| Log type | Where it should come from |
|---|---|
| concise progress logs | rank 0 |
| structured error logs | every rank |
| environment summary | rank 0 and launch layer |
| communicator diagnostics | all affected ranks, throttled |
Debugging A Hang
Section titled “Debugging A Hang”
flowchart TD
A[Job appears stuck] --> B{Are ranks alive?}
B -->|No| C[Process crash / OOM / node issue]
B -->|Yes| D{Progress metric moving?}
D -->|No| E[Deadlock or blocked I/O]
D -->|Yes| F[Slow path, not hang]
E --> G[Check last collective, loader wait, checkpoint write]
Debug Hooks You Can Show In A Notebook
Section titled “Debug Hooks You Can Show In A Notebook”@contextmanagerdef phase(timer_store: dict[str, list[float]], name: str): started = time.perf_counter() try: yield finally: timer_store.setdefault(name, []).append(time.perf_counter() - started)
def log_rank_event(rank: int, event: str, **fields) -> None: payload = {"rank": rank, "event": event, **fields} print(json.dumps(payload, sort_keys=True))Diagnosing Silent Desynchronization
Section titled “Diagnosing Silent Desynchronization”This is the class of failure where the job still runs but semantics drift.
Examples:
- rank 3 is skipping batches after a loader exception
- one rank restored stale optimizer state
- world size changed but effective batch math was not updated
- sampler shuffle seeds differ across ranks
Your defense is invariant checking:
- assert batch counts match expected step counts
- log checkpoint metadata on resume
- compare sampler state across ranks when debugging
- track global batch and accumulation config in emitted metadata