Skip to content

Biotech ML Hard Questions

These questions layer on top of the standard distributed training questions. Biotech ML roles probe domain-specific correctness, biological data intuition, and operational awareness of wet-lab feedback loops.

mindmap
root((Biotech ML Questions))
  Data integrity
    scaffold split
    data leakage
    label sparsity
  Sequence modeling
    length distribution
    equivariance
    pair representation
  Operational
    wet lab loop
    active learning
    distribution shift
  Staff judgment
    assay prioritization
    cost per prediction
    model failure modes
Most biotech ML interview questions fall into data integrity, model architecture, or operational judgment.

1. “How do you prevent data leakage in drug discovery?”

Section titled “1. “How do you prevent data leakage in drug discovery?””

Strong answer shape:

  • distinguish random split from scaffold split from time split
  • name the specific leakage mechanism: molecules sharing a chemical scaffold appear in both partitions, so a model can learn scaffold features rather than activity features
  • say which split fits which evaluation goal: scaffold split for virtual screening generalization, time split for prospective validation
  • note that the same logic applies to protein homology splits in structure prediction

The precise number worth remembering: random split versus scaffold split on ChEMBL-scale datasets typically inflates AUROC by 5–15 points. That gap is the measurement of scaffold memorization, not activity generalization.

2. “How do you handle class imbalance in bioassay prediction?”

Section titled “2. “How do you handle class imbalance in bioassay prediction?””

Good answer:

  • state the typical ratio (100:1 to 1000:1 negative-to-active)
  • describe loss reweighting with pos_weight in BCE as the simplest stateless option
  • describe WeightedRandomSampler as useful but requiring additional checkpoint state
  • say which metric matters: AUROC and EF@1% for hit identification; precision@K for fixed-capacity screening
  • say explicitly that accuracy is misleading at high imbalance

Strong addition:

“I prefer loss reweighting over oversampling in distributed training because it does not change the sampler state contract. With oversampling, I need to checkpoint the sampler’s internal RNG on top of the consumed-batch count, which adds a correctness surface that does not exist with stateless loss weighting.”

3. “What is different about training a protein language model compared to an NLP language model?”

Section titled “3. “What is different about training a protein language model compared to an NLP language model?””

Key differences:

DimensionNLP LMProtein LM
Vocabulary30K–100K tokens20–33 amino acid tokens
Sequence length512–4096 typical10–35,000; heavy-tailed
Memory bottleneckEmbedding table + KV cacheActivation footprint per sequence
AugmentationMasking, back-translationAlmost none; sequence is fixed chemistry
Evaluation signalPerplexity + downstream NLP benchmarksStructure prediction accuracy, fitness prediction

The operational implication: protein LMs need curriculum training by sequence length to avoid OOM events from long-tail sequences early in training, and they benefit more from activation checkpointing than NLP models at equivalent parameter counts because the activation-to-parameter ratio is higher.

4. “When would you use sequence parallelism?”

Section titled “4. “When would you use sequence parallelism?””

Good answer:

  • define it precisely: the sequence axis is partitioned across ranks and each rank operates on a contiguous subsequence
  • state the trigger: when a single sequence no longer fits in device memory even at micro_batch_size=1
  • state the cost: one all-gather per attention layer across the sequence-parallel group
  • state the alternative: activation checkpointing reduces memory but does not help when the forward pass itself cannot fit

Concrete threshold:

“At L=2048 with d_model=1280, a single pair representation in bfloat16 is 1 GB. At that point, micro-batch=1 still exceeds a 40 GB device once you account for activations, optimizer state, and model parameters. Sequence parallelism becomes the required path.”

5. “How do you handle multi-task training across dozens of assays with sparse labels?”

Section titled “5. “How do you handle multi-task training across dozens of assays with sparse labels?””

Strong answer:

  1. use a per-task binary mask to exclude missing labels from the loss computation; the mask is not optional
  2. apply per-task loss weighting based on assay reliability or sample count
  3. confirm that missing labels are truly missing, not negative—this is a common silent error in bioassay databases where an untested compound is stored as zero
  4. monitor per-task validation AUROC separately; global aggregate loss hides task-level degradation

What to say when pushed on the missing-label point:

“Treating missing labels as negatives is one of the most common silent errors in multi-task bioassay models. It systematically biases every head toward predicting inactivity, and it can take dozens of training runs before anyone checks the per-task breakdown.”

6. “How do you think about distribution shift in de novo molecular generation?”

Section titled “6. “How do you think about distribution shift in de novo molecular generation?””
flowchart TD
  A[Train: known actives and decoys] --> B[Generative model]
  B --> C[Generated molecules]
  C --> D{Evaluation}
  D --> E[Proxy metrics: validity, novelty, diversity]
  D --> F[Oracle: docking or ADMET]
  D --> G[Wet lab confirmation]
  E --> H[Fast but not predictive]
  F --> H
  G --> I[Ground truth: weeks to months]
Generative models are evaluated on proxy metrics until wet lab results arrive. Distribution shift is invisible until that point.

Key points:

  • train/test split is irrelevant if you are generating novel chemistry outside the training manifold
  • standard metrics (validity, uniqueness, diversity) do not predict biological activity
  • the practical pipeline is: generate candidates → score with an oracle → submit a subset for wet lab → wait
  • distribution shift is not detectable from model metrics alone; it surfaces in wet lab hit rates against historical baselines

7. “How do you design a training pipeline for an active learning loop?”

Section titled “7. “How do you design a training pipeline for an active learning loop?””

The components that change relative to a one-shot pipeline:

StageOne-shot trainingActive learning
DatasetFixed before trainingGrows after each wet lab round
SamplerOne epoch definitionMust handle new data mid-run
CheckpointResume at stepResume at acquisition round boundary
EvaluationHold-out test setHold-out plus prospective wet lab results

The operational challenge is that the boundary between training and evaluation becomes a workflow step, not a code path. The training system must accept a new labeled batch, re-partition the data, and continue without restarting from scratch.

Strong interview addition:

“I would model the active learning round as a checkpoint event: save the full state, validate the new labeled batch, extend the dataset, rebuild the sampler with the updated indices, and restore from checkpoint. That keeps the training loop unchanged and moves loop management into the orchestrator.”

8. “How do you evaluate a model when wet lab validation takes weeks?”

Section titled “8. “How do you evaluate a model when wet lab validation takes weeks?””

Good answer:

  1. use computational proxies during development: docking scores, in silico ADMET, molecular dynamics
  2. maintain a held-out set of historical assay results the model has never seen
  3. track prediction–experiment correlation as prospective data arrives
  4. treat wet lab hit rate, not model AUROC, as the ultimate success metric

The staff-level sentence:

“A model with AUROC 0.85 that produces a 2% wet lab hit rate at 10% virtual screening cutoff is worse than a model with AUROC 0.80 that produces an 8% hit rate. Computational metrics are proxies for what actually matters.”

9. “How does SE(3) equivariance constrain your training pipeline?”

Section titled “9. “How does SE(3) equivariance constrain your training pipeline?””
flowchart LR
  A[3D coordinates] --> B[Canonicalize frame]
  B --> C[Equivariant network]
  C --> D[Invariant scalar output]
  B -.->|Must precede sharding| E[Data partition into ranks]
  D -.->|Batch norm breaks equivariance| F[Normalization audit]
Frame canonicalization must precede distributed sharding. Rank-local re-centering of a molecular shard is geometrically incorrect.

Three constraints to state explicitly:

  1. Frame canonicalization before sharding. Re-centering or aligning a structure must happen in preprocessing. A DataLoader worker that receives a fragment of a protein cannot independently re-center because it lacks the full structural context.
  2. No rotation augmentation. An equivariant model is rotation-invariant by construction. Augmenting with random rotations wastes compute and can degrade training when equivariance is approximate.
  3. Layer normalization only. Batch normalization aggregates across the batch dimension. For variable-size molecular graphs, batch statistics are inconsistent. Layer normalization operates per node and is always safe.

10. “What would you page on in a protein model training run?”

Section titled “10. “What would you page on in a protein model training run?””
SignalWhy page-worthy
OOM on one rank during sequence ingestionLong-tail sequence entered training; curriculum or length filter failed
Per-task AUROC collapse on one assay headTask-specific data issue or label corruption
Gradient norm spike followed by loss explosionNumerical instability common with long sequences in float16
Wet lab hit rate below historical baselineDistribution shift or generalization failure that metrics did not catch
Checkpoint age exceeds 2× expected intervalI/O stall or storage contention in a checkpoint-heavy run
  • “Scaffold split is the correctness bar; random split is the optimism bar.”
  • “Missing labels in a bioassay database are not negatives. Treating them as negatives is a silent training error.”
  • “Equivariant network training imposes preprocessing contracts that move complexity out of the training loop and into the data pipeline.”
  • “Wet lab hit rate is the only metric the business actually cares about. Computational metrics exist to let us move faster while waiting for it.”
  • “Active learning turns the dataset contract from a precondition into an ongoing invariant.”