Session 3 · Feb 21, 2026
Added classification head on pretrained encoder; fine-tuned on ClinVar SNVs with
chr8 held out — AUROC 0.91
Intent
- With pretraining converged, wanted to evaluate the model's learned representations on a real
downstream task
- Chose ClinVar pathogenicity classification as first benchmark — well-studied, clinically relevant,
clear labels
- Needed to build a fine-tuning pipeline: freeze encoder for first 5 epochs, then unfreeze with lower
LR
Expected
- Binary classification (pathogenic vs. benign) for single nucleotide variants
- Target AUROC > 0.85 to be competitive with existing genomic language models
Result
- Created
tasks/clinvar.py with a linear classification head on the [CLS]-equivalent
embedding
- Extracted ±1 kb flanking context for each variant from hg38; ref and alt sequences as paired inputs
- Two-phase training: 5 epochs frozen encoder (LR 1e-3), then 15 epochs full fine-tune (LR 5e-5)
- Achieved AUROC 0.91 on held-out chr8 — above initial target; precision 0.87, recall 0.84
- Noted class imbalance (3:1 benign:pathogenic); applied focal loss which improved recall from 0.79 to
0.84
Artifacts
tasks/clinvar.py
data/clinvar_snvs.tsv
configs/finetune_clinvar.yaml
train_finetune.py
Strong classification results raised the question: what is the model actually
learning? Decided to inspect attention patterns for biological interpretability
Head 7 in layer 4 strongly attends to splice donor/acceptor motifs (GT–AG); head 2
in layer 6 tracks TATA-like promoter elements
Intent
- Wanted to validate that high AUROC reflects biologically meaningful features, not just sequence
composition bias
- Planned to extract attention matrices from each head and correlate with known regulatory annotations
from ENCODE
- Focus on splice sites and promoter motifs as they are well-characterized ground truth
Expected
- At least some attention heads should show enrichment at known functional motifs
- If no heads correlate, the model may be overfitting on trivial features
Result
- Extracted attention weights from all 8 heads × 6 layers across 500 ClinVar variant regions
- Head L4-H7 shows striking enrichment at canonical splice donor (
GT) and acceptor
(AG) dinucleotides — Pearson r=0.72 with MaxEntScan scores
- Head L6-H2 shows weak but consistent enrichment at TATA-box positions in promoter-proximal regions
- Remaining heads appear to encode positional or GC-content features — no clear biological motif
- Saved visualizations using matplotlib heatmaps overlaid on sequence logos
Artifacts
interpret/attention_analysis.py
notebooks/attention_motifs.ipynb
figures/head_L4H7_splice.png
Attention head analysis confirmed biological learning; wanted finer-grained,
per-nucleotide importance for individual variant predictions
Integrated gradient × input method for per-nucleotide importance; working on
comparing against ISM (in-silico mutagenesis) as ground truth
Intent
- Attention maps show head-level patterns, but clinicians need per-nucleotide importance for a
specific variant
- Planned two methods: gradient × input (fast) and integrated gradients (more accurate)
- Needed to validate attributions against ISM (exhaustive single-nt mutation scan) as gold standard
Expected
- Attribution scores should peak at the variant position and flanking regulatory elements
- Correlation with ISM scores should be r > 0.6 if attributions are trustworthy
Result
- Implemented
gradient_x_input() — fast, runs in ~200ms per variant
- Integrated gradients implemented but slow (~8s per variant with 50 interpolation steps)
- Preliminary comparison on 20 variants: gradient × input vs. ISM correlation r=0.58 — slightly below
target
- Still running full ISM benchmark on 500 variants (estimated 6h on A100)
Artifacts
interpret/attribution.py
interpret/ism.py
scripts/run_ism_benchmark.sh
Session 2 · Feb 20, 2026
OOM at batch 64 / 4 kb context on A100; FlashAttention-2 + gradient checkpointing
reduced peak VRAM from 78 GB to 31 GB
Intent
- Session 1 ended with OOM crash at batch_size=64 on 4 kb context windows on A100 80 GB
- Profiling with
torch.cuda.memory_summary() showed attention matrices consuming ~42 GB
for full O(n²) attention
- Two strategies: (1) FlashAttention-2 for memory-efficient attention, (2) gradient checkpointing to
trade compute for memory
Result
- Installed
flash-attn==2.5.6; replaced nn.MultiheadAttention with
flash_attn.flash_attn_func
- Enabled
torch.utils.checkpoint.checkpoint() on every other transformer block
- Peak VRAM dropped from ~78 GB to ~31 GB; batch_size=64 now fits comfortably on A100 80 GB
- Training throughput decreased ~15% due to checkpointing recomputation — acceptable trade-off
Artifacts
model/transformer.py
requirements.txt
configs/train_pretrain.yaml
OOM resolved — before launching full pretraining, reconsidered whether naive
character-level tokenization was optimal for long genomic sequences
Replaced 1-mer tokenization with BPE (vocab 4,096) trained on hg38 — 3.2×
compression, reducing 4 kb sequences to ~1,280 tokens
Intent
- Character-level tokenization (A/C/G/T) creates very long sequences (4,096 tokens for 4 kb) —
expensive even with FlashAttention
- Hypothesis: BPE can learn biologically meaningful subword units (common k-mers, repeat elements)
while compressing sequence length
- Considered alternatives: fixed 6-mer (4,096 vocab) vs. BPE — BPE can adapt to genome-specific
frequency distributions
Result
- Trained BPE tokenizer using
tokenizers library on chr1–chr22 of hg38 (3.1 Gbp)
- Final vocabulary: 4,096 tokens; average compression ratio 3.2× (4,096 nt → ~1,280 tokens)
- Inspected top tokens: many correspond to common dinucleotide repeats (CpG, CA/TG), Alu elements, and
poly-A stretches
- Concern noted: some canonical splice motifs (GT-AG) get split across token
boundaries — may need motif-aware constraints
Artifacts
tokenizer/train_bpe.py
tokenizer/bpe_4096.json
data/genome_loader.py
Tokenizer ready and memory issues resolved — now had all components to launch
the full pretraining run
50k steps of MLM pretraining on chromosomes 1–7; validation perplexity on chr8
converged to 4.21
Intent
- All components ready: FlashAttention model, BPE tokenizer, data loaders — time for full pretraining
- Used chr1–chr7 as training set (~1.3 Gbp), chr8 as validation (~145 Mbp), chr9 held out for final
test
- Target: perplexity < 5.0 on validation, competitive with Nucleotide Transformer (reported ~4.5 at
150M scale)
Result
- Training completed in ~14 hours on 4× A100 with DeepSpeed ZeRO-2
- Final validation perplexity: 4.21 — better than initial target, likely due to BPE compression
reducing effective sequence complexity
- Loss curves smooth; no signs of overfitting (train perplexity 3.94 vs. val 4.21)
- Saved checkpoints every 10k steps to
checkpoints/pretrain/
- WandB run logged: learning rate schedule (linear warmup 2k steps, cosine decay to 1e-5)
Artifacts
train_pretrain.py
configs/train_pretrain.yaml
checkpoints/pretrain/step_50000/
scripts/launch_distributed.sh
Session 1 · Feb 19, 2026
Set up AlphaGenome project structure; built FASTA data loader with pysam for
streaming hg38 windows with configurable context length
Intent
- Starting a new genomic foundation model project inspired by Nucleotide Transformer and DNABERT-2
- Needed efficient data pipeline for sampling random genomic windows from hg38 reference genome
- Required: streaming reads (genome too large to load in memory), configurable window size, N-masking
for assembly gaps
Result
- Created project structure:
model/, data/, tokenizer/,
tasks/, configs/, scripts/
- Built
GenomeDataset class using pysam.FastaFile for random-access reads
from indexed FASTA
- Supports configurable window sizes (512 nt to 8 kb), chromosome filtering, and N-base exclusion
- DataLoader with 8 workers achieves ~12k windows/sec throughput on NVMe SSD
Artifacts
data/genome_loader.py
data/utils.py
configs/data.yaml
requirements.txt
Data pipeline working — moved on to defining the core model architecture
6-layer Transformer encoder (512 dim, 8 heads) with rotary positional embeddings for
length generalization
Intent
- Chose Transformer encoder (BERT-style) over decoder — MLM objective better suited for bidirectional
genomic context
- Used rotary positional embeddings (RoPE) instead of absolute — better extrapolation to longer
sequences at inference
- Starting small (6 layers, 512 dim, ~25M params) for rapid iteration before scaling up
Result
- Implemented custom
GenomicTransformerEncoder with RoPE in
model/transformer.py
- Added pre-norm (LayerNorm before attention/FFN) for training stability
- Unit tests pass: forward pass on batch of 32 × 2,048 tokens in 45ms on A100
Artifacts
model/transformer.py
model/rope.py
tests/test_model.py
Architecture defined — needed to implement the self-supervised training
objective
15% random nucleotide masking with 80/10/10 mask/random/keep strategy on 2 kb
windows; cross-entropy loss on masked positions only
Intent
- Standard masked language modeling: mask 15% of input nucleotides, train model to predict the
originals
- Used 80/10/10 strategy (80% [MASK] token, 10% random nucleotide, 10% unchanged) following BERT
- Considered span masking (mask contiguous k-mers) but deferred for simplicity in v1
Result
- Implemented
MLMCollator in data/collator.py — handles dynamic masking per
batch
- Loss computed only on masked positions using
CrossEntropyLoss(ignore_index=-100)
- Verified: random baseline achieves ~1.39 nats loss (−ln(1/4)); model should converge well below this
Artifacts
data/collator.py
train_pretrain.py
tests/test_collator.py
All components in place — attempted first end-to-end training run to validate
the pipeline
Training crashed with CUDA OOM at batch_size=64, context=4096 nt on A100 80 GB —
full attention O(n²) too expensive
Intent
- End-to-end smoke test: data loader → tokenizer → model → MLM loss → backward pass
- Wanted to verify the full pipeline works before committing to long training runs
Result
- Forward pass succeeded; loss computed correctly (initial loss ~1.38, close to random baseline as
expected)
- OOM on
loss.backward() at batch_size=64, sequence_length=4096 — peaked at ~78 GB
- Profiling showed attention matrices account for ~42 GB (self-attention is O(n²) in memory)
- Batch_size=8 works but too small for stable training — need FlashAttention or gradient checkpointing
- Deferred memory optimization to next session
Artifacts
train_pretrain.py
logs/oom_profile.txt