Biotech ML Hard Questions

These questions layer on top of the standard distributed training questions. Biotech ML roles probe domain-specific correctness, biological data intuition, and operational awareness of wet-lab feedback loops.

Question Taxonomy

mindmap
root((Biotech ML Questions))
  Data integrity
    scaffold split
    data leakage
    label sparsity
  Sequence modeling
    length distribution
    equivariance
    pair representation
  Operational
    wet lab loop
    active learning
    distribution shift
  Staff judgment
    assay prioritization
    cost per prediction
    model failure modes

Most biotech ML interview questions fall into data integrity, model architecture, or operational judgment.

1. “How do you prevent data leakage in drug discovery?”

Strong answer shape:

distinguish random split from scaffold split from time split
name the specific leakage mechanism: molecules sharing a chemical scaffold appear in both partitions, so a model can learn scaffold features rather than activity features
say which split fits which evaluation goal: scaffold split for virtual screening generalization, time split for prospective validation
note that the same logic applies to protein homology splits in structure prediction

The precise number worth remembering: random split versus scaffold split on ChEMBL-scale datasets typically inflates AUROC by 5–15 points. That gap is the measurement of scaffold memorization, not activity generalization.

2. “How do you handle class imbalance in bioassay prediction?”

Good answer:

state the typical ratio (100:1 to 1000:1 negative-to-active)
describe loss reweighting with pos_weight in BCE as the simplest stateless option
describe WeightedRandomSampler as useful but requiring additional checkpoint state
say which metric matters: AUROC and EF@1% for hit identification; precision@K for fixed-capacity screening
say explicitly that accuracy is misleading at high imbalance

Strong addition:

“I prefer loss reweighting over oversampling in distributed training because it does not change the sampler state contract. With oversampling, I need to checkpoint the sampler’s internal RNG on top of the consumed-batch count, which adds a correctness surface that does not exist with stateless loss weighting.”

3. “What is different about training a protein language model compared to an NLP language model?”

Key differences:

Dimension	NLP LM	Protein LM
Vocabulary	30K–100K tokens	20–33 amino acid tokens
Sequence length	512–4096 typical	10–35,000; heavy-tailed
Memory bottleneck	Embedding table + KV cache	Activation footprint per sequence
Augmentation	Masking, back-translation	Almost none; sequence is fixed chemistry
Evaluation signal	Perplexity + downstream NLP benchmarks	Structure prediction accuracy, fitness prediction

The operational implication: protein LMs need curriculum training by sequence length to avoid OOM events from long-tail sequences early in training, and they benefit more from activation checkpointing than NLP models at equivalent parameter counts because the activation-to-parameter ratio is higher.

4. “When would you use sequence parallelism?”

Good answer:

define it precisely: the sequence axis is partitioned across ranks and each rank operates on a contiguous subsequence
state the trigger: when a single sequence no longer fits in device memory even at micro_batch_size=1
state the cost: one all-gather per attention layer across the sequence-parallel group
state the alternative: activation checkpointing reduces memory but does not help when the forward pass itself cannot fit

Concrete threshold:

“At L=2048 with d_model=1280, a single pair representation in bfloat16 is 1 GB. At that point, micro-batch=1 still exceeds a 40 GB device once you account for activations, optimizer state, and model parameters. Sequence parallelism becomes the required path.”

5. “How do you handle multi-task training across dozens of assays with sparse labels?”

Strong answer:

use a per-task binary mask to exclude missing labels from the loss computation; the mask is not optional
apply per-task loss weighting based on assay reliability or sample count
confirm that missing labels are truly missing, not negative—this is a common silent error in bioassay databases where an untested compound is stored as zero
monitor per-task validation AUROC separately; global aggregate loss hides task-level degradation

What to say when pushed on the missing-label point:

“Treating missing labels as negatives is one of the most common silent errors in multi-task bioassay models. It systematically biases every head toward predicting inactivity, and it can take dozens of training runs before anyone checks the per-task breakdown.”

6. “How do you think about distribution shift in de novo molecular generation?”

flowchart TD
  A[Train: known actives and decoys] --> B[Generative model]
  B --> C[Generated molecules]
  C --> D{Evaluation}
  D --> E[Proxy metrics: validity, novelty, diversity]
  D --> F[Oracle: docking or ADMET]
  D --> G[Wet lab confirmation]
  E --> H[Fast but not predictive]
  F --> H
  G --> I[Ground truth: weeks to months]

Generative models are evaluated on proxy metrics until wet lab results arrive. Distribution shift is invisible until that point.

Key points:

train/test split is irrelevant if you are generating novel chemistry outside the training manifold
standard metrics (validity, uniqueness, diversity) do not predict biological activity
the practical pipeline is: generate candidates → score with an oracle → submit a subset for wet lab → wait
distribution shift is not detectable from model metrics alone; it surfaces in wet lab hit rates against historical baselines

7. “How do you design a training pipeline for an active learning loop?”

The components that change relative to a one-shot pipeline:

Stage	One-shot training	Active learning
Dataset	Fixed before training	Grows after each wet lab round
Sampler	One epoch definition	Must handle new data mid-run
Checkpoint	Resume at step	Resume at acquisition round boundary
Evaluation	Hold-out test set	Hold-out plus prospective wet lab results

The operational challenge is that the boundary between training and evaluation becomes a workflow step, not a code path. The training system must accept a new labeled batch, re-partition the data, and continue without restarting from scratch.

Strong interview addition:

“I would model the active learning round as a checkpoint event: save the full state, validate the new labeled batch, extend the dataset, rebuild the sampler with the updated indices, and restore from checkpoint. That keeps the training loop unchanged and moves loop management into the orchestrator.”

8. “How do you evaluate a model when wet lab validation takes weeks?”

Good answer:

use computational proxies during development: docking scores, in silico ADMET, molecular dynamics
maintain a held-out set of historical assay results the model has never seen
track prediction–experiment correlation as prospective data arrives
treat wet lab hit rate, not model AUROC, as the ultimate success metric

The staff-level sentence:

“A model with AUROC 0.85 that produces a 2% wet lab hit rate at 10% virtual screening cutoff is worse than a model with AUROC 0.80 that produces an 8% hit rate. Computational metrics are proxies for what actually matters.”

9. “How does SE(3) equivariance constrain your training pipeline?”

flowchart LR
  A[3D coordinates] --> B[Canonicalize frame]
  B --> C[Equivariant network]
  C --> D[Invariant scalar output]
  B -.->|Must precede sharding| E[Data partition into ranks]
  D -.->|Batch norm breaks equivariance| F[Normalization audit]

Frame canonicalization must precede distributed sharding. Rank-local re-centering of a molecular shard is geometrically incorrect.

Three constraints to state explicitly:

Frame canonicalization before sharding. Re-centering or aligning a structure must happen in preprocessing. A DataLoader worker that receives a fragment of a protein cannot independently re-center because it lacks the full structural context.
No rotation augmentation. An equivariant model is rotation-invariant by construction. Augmenting with random rotations wastes compute and can degrade training when equivariance is approximate.
Layer normalization only. Batch normalization aggregates across the batch dimension. For variable-size molecular graphs, batch statistics are inconsistent. Layer normalization operates per node and is always safe.

10. “What would you page on in a protein model training run?”

Signal	Why page-worthy
OOM on one rank during sequence ingestion	Long-tail sequence entered training; curriculum or length filter failed
Per-task AUROC collapse on one assay head	Task-specific data issue or label corruption
Gradient norm spike followed by loss explosion	Numerical instability common with long sequences in float16
Wet lab hit rate below historical baseline	Distribution shift or generalization failure that metrics did not catch
Checkpoint age exceeds 2× expected interval	I/O stall or storage contention in a checkpoint-heavy run

High-Signal Sentences for Biotech Roles

“Scaffold split is the correctness bar; random split is the optimism bar.”
“Missing labels in a bioassay database are not negatives. Treating them as negatives is a silent training error.”
“Equivariant network training imposes preprocessing contracts that move complexity out of the training loop and into the data pipeline.”
“Wet lab hit rate is the only metric the business actually cares about. Computational metrics exist to let us move faster while waiting for it.”
“Active learning turns the dataset contract from a precondition into an ongoing invariant.”

Biotech ML Hard Questions

Question Taxonomy

1. “How do you prevent data leakage in drug discovery?”

2. “How do you handle class imbalance in bioassay prediction?”

3. “What is different about training a protein language model compared to an NLP language model?”

4. “When would you use sequence parallelism?”

5. “How do you handle multi-task training across dozens of assays with sparse labels?”

6. “How do you think about distribution shift in de novo molecular generation?”

7. “How do you design a training pipeline for an active learning loop?”

8. “How do you evaluate a model when wet lab validation takes weeks?”

9. “How does SE(3) equivariance constrain your training pipeline?”

10. “What would you page on in a protein model training run?”

High-Signal Sentences for Biotech Roles

References