ContextMap — Project Evolution

ContextMap

alphagenome-fm

📅 Updated Feb 21, 2026 📂 3 sessions 🔢 10 steps

Session Narrative

Session 3 · Feb 21

Fine-tuned AlphaGenome on ClinVar pathogenicity prediction — AUROC 0.91 on held-out chr8
Investigated attention heads; discovered head 7 in layer 4 strongly tracks splice donor/acceptor motifs
Implemented gradient-weighted attribution to visualize per-nucleotide importance scores

Session 2 · Feb 20

Resolved OOM during training by switching from full attention to FlashAttention-2 and gradient checkpointing
Replaced naive 1-mer tokenization with BPE trained on hg38 — vocabulary of 4,096 tokens, ~3.2× compression
Trained 150M-param model on chr1–chr7 for 50k steps; validated on chr8 with perplexity 4.21

Session 1 · Feb 19

Scaffolded AlphaGenome project: data loaders for hg38 FASTA via pysam, configurable context windows
Implemented initial Transformer encoder (6 layers, 512 hidden, 8 heads) with rotary positional embeddings
Built masked language modeling (MLM) objective — 15% random nucleotide masking on 2 kb windows
First training run OOM'd on A100 at batch size 64 with 4 kb context — deferred to Session 2

Where We Left Off

📍

Last Working On

Gradient-weighted attribution maps in interpret/attribution.py — visualizing per-nucleotide importance for ClinVar variant predictions

✅

Current State

150M-param model pretrained (perplexity 4.21), fine-tuned on ClinVar pathogenicity (AUROC 0.91 on chr8)

🔜

Next Up

Benchmark against Enformer and Nucleotide Transformer on CADD score prediction; evaluate on non-coding regulatory variants

⚠️

Open Concern

BPE tokenizer may split biologically meaningful motifs (TATA box, splice sites); consider motif-aware tokenization

🧬

Key Decision

Chose BPE tokenization (4,096 vocab) over fixed k-mer — BPE learns biologically adaptive token boundaries and achieves 3.2× compression over character-level, reducing effective sequence length from 4,096 nt to ~1,280 tokens

Evolution Timeline

Session 3 · Feb 21, 2026

#8 Fine-tune on ClinVar pathogenicity Success

Added classification head on pretrained encoder; fine-tuned on ClinVar SNVs with chr8 held out — AUROC 0.91

Intent

With pretraining converged, wanted to evaluate the model's learned representations on a real downstream task
Chose ClinVar pathogenicity classification as first benchmark — well-studied, clinically relevant, clear labels
Needed to build a fine-tuning pipeline: freeze encoder for first 5 epochs, then unfreeze with lower LR

Expected

Binary classification (pathogenic vs. benign) for single nucleotide variants
Target AUROC > 0.85 to be competitive with existing genomic language models

Result

Created tasks/clinvar.py with a linear classification head on the [CLS]-equivalent embedding
Extracted ±1 kb flanking context for each variant from hg38; ref and alt sequences as paired inputs
Two-phase training: 5 epochs frozen encoder (LR 1e-3), then 15 epochs full fine-tune (LR 5e-5)
Achieved AUROC 0.91 on held-out chr8 — above initial target; precision 0.87, recall 0.84
Noted class imbalance (3:1 benign:pathogenic); applied focal loss which improved recall from 0.79 to 0.84

Artifacts

tasks/clinvar.py data/clinvar_snvs.tsv configs/finetune_clinvar.yaml train_finetune.py

Strong classification results raised the question: what is the model actually learning? Decided to inspect attention patterns for biological interpretability

#9 Analyze attention heads for biological motifs Success

Head 7 in layer 4 strongly attends to splice donor/acceptor motifs (GT–AG); head 2 in layer 6 tracks TATA-like promoter elements

Intent

Wanted to validate that high AUROC reflects biologically meaningful features, not just sequence composition bias
Planned to extract attention matrices from each head and correlate with known regulatory annotations from ENCODE
Focus on splice sites and promoter motifs as they are well-characterized ground truth

Expected

At least some attention heads should show enrichment at known functional motifs
If no heads correlate, the model may be overfitting on trivial features

Result

Extracted attention weights from all 8 heads × 6 layers across 500 ClinVar variant regions
Head L4-H7 shows striking enrichment at canonical splice donor (GT) and acceptor (AG) dinucleotides — Pearson r=0.72 with MaxEntScan scores
Head L6-H2 shows weak but consistent enrichment at TATA-box positions in promoter-proximal regions
Remaining heads appear to encode positional or GC-content features — no clear biological motif
Saved visualizations using matplotlib heatmaps overlaid on sequence logos

Artifacts

interpret/attention_analysis.py notebooks/attention_motifs.ipynb figures/head_L4H7_splice.png

Attention head analysis confirmed biological learning; wanted finer-grained, per-nucleotide importance for individual variant predictions

#10 Implement gradient-weighted attribution In Progress

Integrated gradient × input method for per-nucleotide importance; working on comparing against ISM (in-silico mutagenesis) as ground truth

Intent

Attention maps show head-level patterns, but clinicians need per-nucleotide importance for a specific variant
Planned two methods: gradient × input (fast) and integrated gradients (more accurate)
Needed to validate attributions against ISM (exhaustive single-nt mutation scan) as gold standard

Expected

Attribution scores should peak at the variant position and flanking regulatory elements
Correlation with ISM scores should be r > 0.6 if attributions are trustworthy

Result

Implemented gradient_x_input() — fast, runs in ~200ms per variant
Integrated gradients implemented but slow (~8s per variant with 50 interpolation steps)
Preliminary comparison on 20 variants: gradient × input vs. ISM correlation r=0.58 — slightly below target
Still running full ISM benchmark on 500 variants (estimated 6h on A100)

Artifacts

interpret/attribution.py interpret/ism.py scripts/run_ism_benchmark.sh

Session 2 · Feb 20, 2026

#5 Fix OOM with FlashAttention-2 + gradient checkpointing Success

OOM at batch 64 / 4 kb context on A100; FlashAttention-2 + gradient checkpointing reduced peak VRAM from 78 GB to 31 GB

Intent

Session 1 ended with OOM crash at batch_size=64 on 4 kb context windows on A100 80 GB
Profiling with torch.cuda.memory_summary() showed attention matrices consuming ~42 GB for full O(n²) attention
Two strategies: (1) FlashAttention-2 for memory-efficient attention, (2) gradient checkpointing to trade compute for memory

Result

Installed flash-attn==2.5.6; replaced nn.MultiheadAttention with flash_attn.flash_attn_func
Enabled torch.utils.checkpoint.checkpoint() on every other transformer block
Peak VRAM dropped from ~78 GB to ~31 GB; batch_size=64 now fits comfortably on A100 80 GB
Training throughput decreased ~15% due to checkpointing recomputation — acceptable trade-off

Artifacts

model/transformer.py requirements.txt configs/train_pretrain.yaml

OOM resolved — before launching full pretraining, reconsidered whether naive character-level tokenization was optimal for long genomic sequences

#6 BPE tokenizer trained on hg38 genome Success

Replaced 1-mer tokenization with BPE (vocab 4,096) trained on hg38 — 3.2× compression, reducing 4 kb sequences to ~1,280 tokens

Intent

Character-level tokenization (A/C/G/T) creates very long sequences (4,096 tokens for 4 kb) — expensive even with FlashAttention
Hypothesis: BPE can learn biologically meaningful subword units (common k-mers, repeat elements) while compressing sequence length
Considered alternatives: fixed 6-mer (4,096 vocab) vs. BPE — BPE can adapt to genome-specific frequency distributions

Result

Trained BPE tokenizer using tokenizers library on chr1–chr22 of hg38 (3.1 Gbp)
Final vocabulary: 4,096 tokens; average compression ratio 3.2× (4,096 nt → ~1,280 tokens)
Inspected top tokens: many correspond to common dinucleotide repeats (CpG, CA/TG), Alu elements, and poly-A stretches
Concern noted: some canonical splice motifs (GT-AG) get split across token boundaries — may need motif-aware constraints

Artifacts

tokenizer/train_bpe.py tokenizer/bpe_4096.json data/genome_loader.py

Tokenizer ready and memory issues resolved — now had all components to launch the full pretraining run

#7 Pretrain 150M model on chr1–chr7 (50k steps) Success

50k steps of MLM pretraining on chromosomes 1–7; validation perplexity on chr8 converged to 4.21

Intent

All components ready: FlashAttention model, BPE tokenizer, data loaders — time for full pretraining
Used chr1–chr7 as training set (~1.3 Gbp), chr8 as validation (~145 Mbp), chr9 held out for final test
Target: perplexity < 5.0 on validation, competitive with Nucleotide Transformer (reported ~4.5 at 150M scale)

Result

Training completed in ~14 hours on 4× A100 with DeepSpeed ZeRO-2
Final validation perplexity: 4.21 — better than initial target, likely due to BPE compression reducing effective sequence complexity
Loss curves smooth; no signs of overfitting (train perplexity 3.94 vs. val 4.21)
Saved checkpoints every 10k steps to checkpoints/pretrain/
WandB run logged: learning rate schedule (linear warmup 2k steps, cosine decay to 1e-5)

Artifacts

train_pretrain.py configs/train_pretrain.yaml checkpoints/pretrain/step_50000/ scripts/launch_distributed.sh

Session 1 · Feb 19, 2026

#1 Scaffold project + hg38 data loaders Success

Set up AlphaGenome project structure; built FASTA data loader with pysam for streaming hg38 windows with configurable context length

Intent

Starting a new genomic foundation model project inspired by Nucleotide Transformer and DNABERT-2
Needed efficient data pipeline for sampling random genomic windows from hg38 reference genome
Required: streaming reads (genome too large to load in memory), configurable window size, N-masking for assembly gaps

Result

Created project structure: model/, data/, tokenizer/, tasks/, configs/, scripts/
Built GenomeDataset class using pysam.FastaFile for random-access reads from indexed FASTA
Supports configurable window sizes (512 nt to 8 kb), chromosome filtering, and N-base exclusion
DataLoader with 8 workers achieves ~12k windows/sec throughput on NVMe SSD

Artifacts

data/genome_loader.py data/utils.py configs/data.yaml requirements.txt

Data pipeline working — moved on to defining the core model architecture

#2 Implement Transformer encoder with RoPE Success

6-layer Transformer encoder (512 dim, 8 heads) with rotary positional embeddings for length generalization

Intent

Chose Transformer encoder (BERT-style) over decoder — MLM objective better suited for bidirectional genomic context
Used rotary positional embeddings (RoPE) instead of absolute — better extrapolation to longer sequences at inference
Starting small (6 layers, 512 dim, ~25M params) for rapid iteration before scaling up

Result

Implemented custom GenomicTransformerEncoder with RoPE in model/transformer.py
Added pre-norm (LayerNorm before attention/FFN) for training stability
Unit tests pass: forward pass on batch of 32 × 2,048 tokens in 45ms on A100

Artifacts

model/transformer.py model/rope.py tests/test_model.py

Architecture defined — needed to implement the self-supervised training objective

#3 Build MLM training objective Success

15% random nucleotide masking with 80/10/10 mask/random/keep strategy on 2 kb windows; cross-entropy loss on masked positions only

Intent

Standard masked language modeling: mask 15% of input nucleotides, train model to predict the originals
Used 80/10/10 strategy (80% [MASK] token, 10% random nucleotide, 10% unchanged) following BERT
Considered span masking (mask contiguous k-mers) but deferred for simplicity in v1

Result

Implemented MLMCollator in data/collator.py — handles dynamic masking per batch
Loss computed only on masked positions using CrossEntropyLoss(ignore_index=-100)
Verified: random baseline achieves ~1.39 nats loss (−ln(1/4)); model should converge well below this

Artifacts

data/collator.py train_pretrain.py tests/test_collator.py

All components in place — attempted first end-to-end training run to validate the pipeline

#4 First training run — OOM crash Failed

Training crashed with CUDA OOM at batch_size=64, context=4096 nt on A100 80 GB — full attention O(n²) too expensive

Intent

End-to-end smoke test: data loader → tokenizer → model → MLM loss → backward pass
Wanted to verify the full pipeline works before committing to long training runs

Result

Forward pass succeeded; loss computed correctly (initial loss ~1.38, close to random baseline as expected)
OOM on loss.backward() at batch_size=64, sequence_length=4096 — peaked at ~78 GB
Profiling showed attention matrices account for ~42 GB (self-attention is O(n²) in memory)
Batch_size=8 works but too small for stable training — need FlashAttention or gradient checkpointing
Deferred memory optimization to next session

Artifacts

train_pretrain.py logs/oom_profile.txt

Open Threads

🧪

CADD Score Benchmark

Need to benchmark variant effect prediction against CADD v1.7 and Enformer on non-coding regulatory variants.

→ Download CADD annotations, implement evaluation script, compare AUROC/AUPRC

🔤

Motif-Aware Tokenization

BPE tokenizer sometimes splits known biological motifs (TATA box, splice sites). Consider adding motif constraints or a hybrid tokenizer.

→ Compile JASPAR motif database, add "never-split" rules to BPE, retrain tokenizer

📈

Scale to 500M Parameters

Current 150M model shows promising results. Literature suggests genomic LMs benefit significantly from scaling (Nucleotide Transformer: 2.5B).

→ Design 500M config (12 layers, 1024 dim, 16 heads), estimate compute budget on 8× A100 node

🧬

Enhancer Activity Prediction Task

ENCODE STARR-seq data provides quantitative enhancer activity labels. Good second fine-tuning task to test regression capabilities.

→ Process STARR-seq data from ENCODE, build regression head, fine-tune and evaluate Pearson r