alphagenome-fm
📅 Updated Feb 21, 2026 📂 3 sessions 🔢 10 steps
Session 3 · Feb 21
  • Fine-tuned AlphaGenome on ClinVar pathogenicity prediction — AUROC 0.91 on held-out chr8
  • Investigated attention heads; discovered head 7 in layer 4 strongly tracks splice donor/acceptor motifs
  • Implemented gradient-weighted attribution to visualize per-nucleotide importance scores
Session 2 · Feb 20
  • Resolved OOM during training by switching from full attention to FlashAttention-2 and gradient checkpointing
  • Replaced naive 1-mer tokenization with BPE trained on hg38 — vocabulary of 4,096 tokens, ~3.2× compression
  • Trained 150M-param model on chr1–chr7 for 50k steps; validated on chr8 with perplexity 4.21
Session 1 · Feb 19
  • Scaffolded AlphaGenome project: data loaders for hg38 FASTA via pysam, configurable context windows
  • Implemented initial Transformer encoder (6 layers, 512 hidden, 8 heads) with rotary positional embeddings
  • Built masked language modeling (MLM) objective — 15% random nucleotide masking on 2 kb windows
  • First training run OOM'd on A100 at batch size 64 with 4 kb context — deferred to Session 2
📍
Last Working On
Gradient-weighted attribution maps in interpret/attribution.py — visualizing per-nucleotide importance for ClinVar variant predictions
Current State
150M-param model pretrained (perplexity 4.21), fine-tuned on ClinVar pathogenicity (AUROC 0.91 on chr8)
🔜
Next Up
Benchmark against Enformer and Nucleotide Transformer on CADD score prediction; evaluate on non-coding regulatory variants
⚠️
Open Concern
BPE tokenizer may split biologically meaningful motifs (TATA box, splice sites); consider motif-aware tokenization
🧬
Key Decision
Chose BPE tokenization (4,096 vocab) over fixed k-mer — BPE learns biologically adaptive token boundaries and achieves 3.2× compression over character-level, reducing effective sequence length from 4,096 nt to ~1,280 tokens
Session 3 · Feb 21, 2026
#8 Fine-tune on ClinVar pathogenicity Success
Added classification head on pretrained encoder; fine-tuned on ClinVar SNVs with chr8 held out — AUROC 0.91
Intent
  • With pretraining converged, wanted to evaluate the model's learned representations on a real downstream task
  • Chose ClinVar pathogenicity classification as first benchmark — well-studied, clinically relevant, clear labels
  • Needed to build a fine-tuning pipeline: freeze encoder for first 5 epochs, then unfreeze with lower LR
Expected
  • Binary classification (pathogenic vs. benign) for single nucleotide variants
  • Target AUROC > 0.85 to be competitive with existing genomic language models
Result
  • Created tasks/clinvar.py with a linear classification head on the [CLS]-equivalent embedding
  • Extracted ±1 kb flanking context for each variant from hg38; ref and alt sequences as paired inputs
  • Two-phase training: 5 epochs frozen encoder (LR 1e-3), then 15 epochs full fine-tune (LR 5e-5)
  • Achieved AUROC 0.91 on held-out chr8 — above initial target; precision 0.87, recall 0.84
  • Noted class imbalance (3:1 benign:pathogenic); applied focal loss which improved recall from 0.79 to 0.84
Artifacts
tasks/clinvar.py data/clinvar_snvs.tsv configs/finetune_clinvar.yaml train_finetune.py
Strong classification results raised the question: what is the model actually learning? Decided to inspect attention patterns for biological interpretability
#9 Analyze attention heads for biological motifs Success
Head 7 in layer 4 strongly attends to splice donor/acceptor motifs (GT–AG); head 2 in layer 6 tracks TATA-like promoter elements
Intent
  • Wanted to validate that high AUROC reflects biologically meaningful features, not just sequence composition bias
  • Planned to extract attention matrices from each head and correlate with known regulatory annotations from ENCODE
  • Focus on splice sites and promoter motifs as they are well-characterized ground truth
Expected
  • At least some attention heads should show enrichment at known functional motifs
  • If no heads correlate, the model may be overfitting on trivial features
Result
  • Extracted attention weights from all 8 heads × 6 layers across 500 ClinVar variant regions
  • Head L4-H7 shows striking enrichment at canonical splice donor (GT) and acceptor (AG) dinucleotides — Pearson r=0.72 with MaxEntScan scores
  • Head L6-H2 shows weak but consistent enrichment at TATA-box positions in promoter-proximal regions
  • Remaining heads appear to encode positional or GC-content features — no clear biological motif
  • Saved visualizations using matplotlib heatmaps overlaid on sequence logos
Artifacts
interpret/attention_analysis.py notebooks/attention_motifs.ipynb figures/head_L4H7_splice.png
Attention head analysis confirmed biological learning; wanted finer-grained, per-nucleotide importance for individual variant predictions
#10 Implement gradient-weighted attribution In Progress
Integrated gradient × input method for per-nucleotide importance; working on comparing against ISM (in-silico mutagenesis) as ground truth
Intent
  • Attention maps show head-level patterns, but clinicians need per-nucleotide importance for a specific variant
  • Planned two methods: gradient × input (fast) and integrated gradients (more accurate)
  • Needed to validate attributions against ISM (exhaustive single-nt mutation scan) as gold standard
Expected
  • Attribution scores should peak at the variant position and flanking regulatory elements
  • Correlation with ISM scores should be r > 0.6 if attributions are trustworthy
Result
  • Implemented gradient_x_input() — fast, runs in ~200ms per variant
  • Integrated gradients implemented but slow (~8s per variant with 50 interpolation steps)
  • Preliminary comparison on 20 variants: gradient × input vs. ISM correlation r=0.58 — slightly below target
  • Still running full ISM benchmark on 500 variants (estimated 6h on A100)
Artifacts
interpret/attribution.py interpret/ism.py scripts/run_ism_benchmark.sh
Session 2 · Feb 20, 2026
#5 Fix OOM with FlashAttention-2 + gradient checkpointing Success
OOM at batch 64 / 4 kb context on A100; FlashAttention-2 + gradient checkpointing reduced peak VRAM from 78 GB to 31 GB
Intent
  • Session 1 ended with OOM crash at batch_size=64 on 4 kb context windows on A100 80 GB
  • Profiling with torch.cuda.memory_summary() showed attention matrices consuming ~42 GB for full O(n²) attention
  • Two strategies: (1) FlashAttention-2 for memory-efficient attention, (2) gradient checkpointing to trade compute for memory
Result
  • Installed flash-attn==2.5.6; replaced nn.MultiheadAttention with flash_attn.flash_attn_func
  • Enabled torch.utils.checkpoint.checkpoint() on every other transformer block
  • Peak VRAM dropped from ~78 GB to ~31 GB; batch_size=64 now fits comfortably on A100 80 GB
  • Training throughput decreased ~15% due to checkpointing recomputation — acceptable trade-off
Artifacts
model/transformer.py requirements.txt configs/train_pretrain.yaml
OOM resolved — before launching full pretraining, reconsidered whether naive character-level tokenization was optimal for long genomic sequences
#6 BPE tokenizer trained on hg38 genome Success
Replaced 1-mer tokenization with BPE (vocab 4,096) trained on hg38 — 3.2× compression, reducing 4 kb sequences to ~1,280 tokens
Intent
  • Character-level tokenization (A/C/G/T) creates very long sequences (4,096 tokens for 4 kb) — expensive even with FlashAttention
  • Hypothesis: BPE can learn biologically meaningful subword units (common k-mers, repeat elements) while compressing sequence length
  • Considered alternatives: fixed 6-mer (4,096 vocab) vs. BPE — BPE can adapt to genome-specific frequency distributions
Result
  • Trained BPE tokenizer using tokenizers library on chr1–chr22 of hg38 (3.1 Gbp)
  • Final vocabulary: 4,096 tokens; average compression ratio 3.2× (4,096 nt → ~1,280 tokens)
  • Inspected top tokens: many correspond to common dinucleotide repeats (CpG, CA/TG), Alu elements, and poly-A stretches
  • Concern noted: some canonical splice motifs (GT-AG) get split across token boundaries — may need motif-aware constraints
Artifacts
tokenizer/train_bpe.py tokenizer/bpe_4096.json data/genome_loader.py
Tokenizer ready and memory issues resolved — now had all components to launch the full pretraining run
#7 Pretrain 150M model on chr1–chr7 (50k steps) Success
50k steps of MLM pretraining on chromosomes 1–7; validation perplexity on chr8 converged to 4.21
Intent
  • All components ready: FlashAttention model, BPE tokenizer, data loaders — time for full pretraining
  • Used chr1–chr7 as training set (~1.3 Gbp), chr8 as validation (~145 Mbp), chr9 held out for final test
  • Target: perplexity < 5.0 on validation, competitive with Nucleotide Transformer (reported ~4.5 at 150M scale)
Result
  • Training completed in ~14 hours on 4× A100 with DeepSpeed ZeRO-2
  • Final validation perplexity: 4.21 — better than initial target, likely due to BPE compression reducing effective sequence complexity
  • Loss curves smooth; no signs of overfitting (train perplexity 3.94 vs. val 4.21)
  • Saved checkpoints every 10k steps to checkpoints/pretrain/
  • WandB run logged: learning rate schedule (linear warmup 2k steps, cosine decay to 1e-5)
Artifacts
train_pretrain.py configs/train_pretrain.yaml checkpoints/pretrain/step_50000/ scripts/launch_distributed.sh
Session 1 · Feb 19, 2026
#1 Scaffold project + hg38 data loaders Success
Set up AlphaGenome project structure; built FASTA data loader with pysam for streaming hg38 windows with configurable context length
Intent
  • Starting a new genomic foundation model project inspired by Nucleotide Transformer and DNABERT-2
  • Needed efficient data pipeline for sampling random genomic windows from hg38 reference genome
  • Required: streaming reads (genome too large to load in memory), configurable window size, N-masking for assembly gaps
Result
  • Created project structure: model/, data/, tokenizer/, tasks/, configs/, scripts/
  • Built GenomeDataset class using pysam.FastaFile for random-access reads from indexed FASTA
  • Supports configurable window sizes (512 nt to 8 kb), chromosome filtering, and N-base exclusion
  • DataLoader with 8 workers achieves ~12k windows/sec throughput on NVMe SSD
Artifacts
data/genome_loader.py data/utils.py configs/data.yaml requirements.txt
Data pipeline working — moved on to defining the core model architecture
#2 Implement Transformer encoder with RoPE Success
6-layer Transformer encoder (512 dim, 8 heads) with rotary positional embeddings for length generalization
Intent
  • Chose Transformer encoder (BERT-style) over decoder — MLM objective better suited for bidirectional genomic context
  • Used rotary positional embeddings (RoPE) instead of absolute — better extrapolation to longer sequences at inference
  • Starting small (6 layers, 512 dim, ~25M params) for rapid iteration before scaling up
Result
  • Implemented custom GenomicTransformerEncoder with RoPE in model/transformer.py
  • Added pre-norm (LayerNorm before attention/FFN) for training stability
  • Unit tests pass: forward pass on batch of 32 × 2,048 tokens in 45ms on A100
Artifacts
model/transformer.py model/rope.py tests/test_model.py
Architecture defined — needed to implement the self-supervised training objective
#3 Build MLM training objective Success
15% random nucleotide masking with 80/10/10 mask/random/keep strategy on 2 kb windows; cross-entropy loss on masked positions only
Intent
  • Standard masked language modeling: mask 15% of input nucleotides, train model to predict the originals
  • Used 80/10/10 strategy (80% [MASK] token, 10% random nucleotide, 10% unchanged) following BERT
  • Considered span masking (mask contiguous k-mers) but deferred for simplicity in v1
Result
  • Implemented MLMCollator in data/collator.py — handles dynamic masking per batch
  • Loss computed only on masked positions using CrossEntropyLoss(ignore_index=-100)
  • Verified: random baseline achieves ~1.39 nats loss (−ln(1/4)); model should converge well below this
Artifacts
data/collator.py train_pretrain.py tests/test_collator.py
All components in place — attempted first end-to-end training run to validate the pipeline
#4 First training run — OOM crash Failed
Training crashed with CUDA OOM at batch_size=64, context=4096 nt on A100 80 GB — full attention O(n²) too expensive
Intent
  • End-to-end smoke test: data loader → tokenizer → model → MLM loss → backward pass
  • Wanted to verify the full pipeline works before committing to long training runs
Result
  • Forward pass succeeded; loss computed correctly (initial loss ~1.38, close to random baseline as expected)
  • OOM on loss.backward() at batch_size=64, sequence_length=4096 — peaked at ~78 GB
  • Profiling showed attention matrices account for ~42 GB (self-attention is O(n²) in memory)
  • Batch_size=8 works but too small for stable training — need FlashAttention or gradient checkpointing
  • Deferred memory optimization to next session
Artifacts
train_pretrain.py logs/oom_profile.txt
🧪
CADD Score Benchmark
Need to benchmark variant effect prediction against CADD v1.7 and Enformer on non-coding regulatory variants.
→ Download CADD annotations, implement evaluation script, compare AUROC/AUPRC
🔤
Motif-Aware Tokenization
BPE tokenizer sometimes splits known biological motifs (TATA box, splice sites). Consider adding motif constraints or a hybrid tokenizer.
→ Compile JASPAR motif database, add "never-split" rules to BPE, retrain tokenizer
📈
Scale to 500M Parameters
Current 150M model shows promising results. Literature suggests genomic LMs benefit significantly from scaling (Nucleotide Transformer: 2.5B).
→ Design 500M config (12 layers, 1024 dim, 16 heads), estimate compute budget on 8× A100 node
🧬
Enhancer Activity Prediction Task
ENCODE STARR-seq data provides quantitative enhancer activity labels. Good second fine-tuning task to test regression capabilities.
→ Process STARR-seq data from ENCODE, build regression head, fine-tune and evaluate Pearson r