Colab Exercises
These drills are meant to be typed, explained, and defended. Treat them like mock interview reps.
Drill 1: Build The Happy Path In 15 Minutes
Section titled “Drill 1: Build The Happy Path In 15 Minutes”Goal:
- config dataclass
- toy dataset
- model
- single-process training loop
- metrics printout
What to prove:
- you can create structure before distribution
- you keep code readable under time pressure
Drill 2: Lift To Distributed Training
Section titled “Drill 2: Lift To Distributed Training”flowchart LR A[Single-process trainer] --> B[Initialize process group] B --> C[Set device from local rank] C --> D[Use distributed sampler] D --> E[Wrap with DDP] E --> F[Rank-aware metrics and checkpointing]
Success criteria:
rank,local_rank,world_sizeare explicit- sampler is deterministic
- non-rank-0 logging is controlled
Drill 3: Resume After Failure
Section titled “Drill 3: Resume After Failure”Implement or pseudocode:
def restore_if_present(model, optimizer, sampler, cfg): path = latest_checkpoint_path(cfg.checkpoint_dir) if path is None: return SimpleNamespace(epoch=0, step=0)
state = torch.load(path, map_location="cpu") model_state = state["model"] unwrap(model).load_state_dict(model_state) optimizer.load_state_dict(state["optimizer"]) sampler.load_state_dict(state["sampler"]) return SimpleNamespace(epoch=state["epoch"], step=state["step"])Then explain:
- what happens if topology changed
- how to verify checkpoint completeness
- why sampler state matters
Drill 4: Instrument Performance
Section titled “Drill 4: Instrument Performance”Track:
- loader wait
- forward
- backward
- optimizer
- checkpoint save
Then answer:
- Which phase is most likely to scale poorly first?
- Which phase is most likely to have long-tail spikes?
- Which phase is easiest to make observable in a notebook?
Drill 5: Whiteboard The Production Lift
Section titled “Drill 5: Whiteboard The Production Lift”flowchart TD A[Notebook prototype] --> B[Containerized trainer] B --> C[Job launcher] C --> D[Artifact + config service] D --> E[Metrics / logging stack] E --> F[Policy layer for retries and cost]
Fast Mock Prompts
Section titled “Fast Mock Prompts”| Prompt | What a good answer should emphasize |
|---|---|
| ”The loss drops, but throughput is terrible.” | Step-time decomposition and loader vs sync diagnosis |
| ”The job resumes, but metrics look strange.” | Batch semantics, LR continuity, sampler replay |
| ”One GPU has lower utilization than the others.” | Rank-local skew, data imbalance, hardware or placement issue |
| ”Checkpointing makes the job stall.” | RPO vs I/O overhead, async artifact handling, manifest publication |