Molecular and Genomic Data Pipelines
Generic image or tabular datasets rarely expose the data pipeline complexity that biological datasets introduce. Molecules are graphs with variable topology. Protein sequences span two orders of magnitude in length. Genomic matrices are large and sparse. Each changes the pipeline design in ways that surface as correctness failures, not just performance problems.
The Biological Data Landscape
Section titled “The Biological Data Landscape”| Data type | Representation | Training challenge |
|---|---|---|
| Small molecules | SMILES, InChI, molecular graphs | Variable atom count, chirality, canonicalization |
| Protein sequences | Amino acid strings | Length variation from 10 to 35,000+ residues |
| Multiple sequence alignments | N sequences × L positions | Variable N and L; large tensors with gap tokens |
| 3D protein structures | Coordinates, torsion angles | SE(3) invariance; coordinate frame sensitivity |
| scRNA-seq | Sparse cell × gene count matrices | 90–99% zeros; high gene dimensionality |
| Bioassay labels | IC50, % inhibition, binary hit | Severe class imbalance; censored measurements |
Data Path for Molecular Training
Section titled “Data Path for Molecular Training”flowchart LR A[SMILES / FASTA store] --> B[Canonicalization + validity filter] B --> C[Featurization: graph or token] C --> D[Scaffold-aware split] D --> E[Rank-aware sampler] E --> F[Variable-length collation] F --> G[Trainer step]
Scaffold Splits
Section titled “Scaffold Splits”Random train/test splits in drug discovery produce optimistic evaluation. Molecules that share a chemical scaffold appear in both splits, leaking structural information from test into train.
A Murcko scaffold split groups molecules by their ring framework and assigns entire scaffold groups to a single partition:
import randomfrom collections import defaultdict
from rdkit import Chemfrom rdkit.Chem.Scaffolds import MurckoScaffold
def scaffold_split( smiles_list: list[str], train_frac: float = 0.8, seed: int = 0,) -> tuple[list[int], list[int]]: scaffold_to_indices: dict[str, list[int]] = defaultdict(list) for idx, smi in enumerate(smiles_list): mol = Chem.MolFromSmiles(smi) if mol is None: continue scaffold = MurckoScaffold.MurckoScaffoldSmiles(mol=mol, includeChirality=False) scaffold_to_indices[scaffold].append(idx)
rng = random.Random(seed) scaffold_sets = list(scaffold_to_indices.values()) rng.shuffle(scaffold_sets)
train_indices: list[int] = [] test_indices: list[int] = [] cutoff = int(len(smiles_list) * train_frac) for group in scaffold_sets: if len(train_indices) < cutoff: train_indices.extend(group) else: test_indices.extend(group) return train_indices, test_indicesThe same logic applies when sharding across distributed ranks: the scaffold group, not the individual molecule, is the unit of partition assignment. Splitting a scaffold group across train and evaluation ranks defeats the purpose of the split.
A random split versus scaffold split on ChEMBL-scale datasets typically inflates AUROC by 5–15 points. That gap measures scaffold memorization, not activity generalization.
Variable-Length Collation
Section titled “Variable-Length Collation”Standard DataLoader collation fails on sequences or graphs of unequal length in the same batch.
| Strategy | When to use | Tradeoff |
|---|---|---|
| Padding + attention mask | Fixed-depth transformers, protein encoders | Wasted compute on pad tokens; occupancy drops for long-tail batches |
| Dynamic length bucketing | Wide length distributions | Reduces padding waste; complicates sampler and resume tracking |
| Graph-level batching | Molecular GNNs | Requires specialized collation; batch vector tracks graph membership |
from torch.nn.utils.rnn import pad_sequence
def collate_sequences( batch: list[dict[str, torch.Tensor]],) -> dict[str, torch.Tensor]: input_ids = pad_sequence( [item["input_ids"] for item in batch], batch_first=True, padding_value=0, ) attention_mask = pad_sequence( [torch.ones(len(item["input_ids"]), dtype=torch.bool) for item in batch], batch_first=True, padding_value=False, ) labels = torch.stack([item["label"] for item in batch]) return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}The attention mask is not cosmetic. Padded positions contribute to loss unless explicitly masked. In a distributed trainer, every rank must apply identical masking logic or gradient semantics diverge silently.
Molecular Graph Batching
Section titled “Molecular Graph Batching”For GNNs on molecules, torch_geometric represents a batch as a single large disconnected graph with a batch vector that maps each node back to its source molecule:
from rdkit import Chemimport torchfrom torch_geometric.data import Data
def smiles_to_graph(smiles: str, label: float) -> Data | None: mol = Chem.MolFromSmiles(smiles) if mol is None: return None
node_features = torch.tensor( [ [atom.GetAtomicNum(), atom.GetDegree(), int(atom.GetIsAromatic())] for atom in mol.GetAtoms() ], dtype=torch.float, ) edges = [(b.GetBeginAtomIdx(), b.GetEndAtomIdx()) for b in mol.GetBonds()] if not edges: edge_index = torch.zeros((2, 0), dtype=torch.long) else: src, dst = zip(*edges) edge_index = torch.tensor([src + dst, dst + src], dtype=torch.long)
return Data(x=node_features, edge_index=edge_index, y=torch.tensor([label]))The batch vector produced by Batch.from_data_list() enables graph-level readout (global mean pool, global add pool) to produce one embedding per molecule rather than one per atom. Without it, pooling averages across all atoms in the concatenated batch, producing meaningless embeddings.
Sparse Genomic Data
Section titled “Sparse Genomic Data”Single-cell RNA-seq count matrices are 90–99% zeros. Loading them as dense tensors materializes gigabytes of zeros before any computation.
flowchart TD
A[Raw count matrix: cells × genes] --> B{Storage format}
B --> C[AnnData h5ad: CSR on disk]
B --> D[Dense in-memory: prohibitive at atlas scale]
C --> E[Row-by-row slice in Dataset.__getitem__]
E --> F[to_dense on batch only]
F --> G[Normalize and log-transform]
G --> H[Trainer step]
import scipy.sparseimport torchfrom torch.utils.data import Dataset
class SingleCellDataset(Dataset): def __init__( self, counts: scipy.sparse.csr_matrix, labels: torch.Tensor, target_sum: float = 1e4, ) -> None: self.counts = counts self.labels = labels self.target_sum = target_sum
def __len__(self) -> int: return self.counts.shape[0]
def __getitem__(self, idx: int) -> dict[str, torch.Tensor]: row = torch.tensor(self.counts[idx].toarray().ravel(), dtype=torch.float32) row = torch.log1p(row / (row.sum() + 1e-6) * self.target_sum) return {"expression": row, "label": self.labels[idx]}Class Imbalance in Bioassay Data
Section titled “Class Imbalance in Bioassay Data”High-throughput screening data routinely produces 100:1 to 1000:1 negative-to-active ratios.
| Approach | Mechanism | Resume complication |
|---|---|---|
Loss reweighting (pos_weight) | Penalize false negatives more | None; stateless |
WeightedRandomSampler | Oversample actives per epoch | Must save sampler RNG state on top of consumed count |
| Focal loss | Down-weight easy negatives | Extra hyperparameter; may degrade calibration |
Weighted sampling changes the epoch definition, which makes resume correctness harder. If the checkpoint captures a consumed count from a DistributedSampler but not the internal state of a WeightedRandomSampler, resume silently restarts sampling from a different distribution.
The strong interview sentence:
“I prefer loss reweighting over oversampling in distributed training because it does not change the sampler state contract. With oversampling I need to checkpoint the sampler’s internal RNG on top of the consumed-batch count, which adds a correctness surface that does not exist with stateless loss weighting.”
Metrics That Reflect Biological Reality
Section titled “Metrics That Reflect Biological Reality”Raw loss and accuracy are almost uninformative for drug discovery models.
| Metric | Definition | Why it matters |
|---|---|---|
| AUROC | Area under ROC curve | Threshold-free; standard for binary classification on imbalanced data |
| BEDROC | Boltzmann-enhanced discrimination ROC | Emphasizes early enrichment; reflects virtual screening economics |
| EF@1% | (actives in top 1%) / (expected by chance) | Standard KPI for hit identification campaigns |
| Scaffold generalization gap | Train-scaffold AUROC minus test-scaffold AUROC | Quantifies leakage; >0.1 suggests scaffold overfitting |
| Precision@K | Fraction of true actives in top K predictions | Operationally relevant when wet lab capacity is fixed at K compounds |
Staff-Level Tradeoffs
Section titled “Staff-Level Tradeoffs”| Decision | Why you choose it | What it costs |
|---|---|---|
| Scaffold split over random | Realistic generalization estimate | Fewer training molecules; noisier split variance |
| Dynamic batching over fixed padding | Better GPU occupancy | Complicates sampler resume tracking |
| Sparse loading for genomics | Memory-safe at atlas scale | More complex collation; harder to pin memory |
| Loss reweighting over oversampling | Simpler resume semantics | May produce poorly calibrated probabilities |
| AUROC over accuracy | Appropriate for imbalanced labels | Less interpretable to non-ML stakeholders |