Navigation

Overview Two lenses, one bf16 reference Part A — Verify gate A. make verify — the correctness gate → Latest results (8/8 × 2) Part B — Diagnosis lens B. make diagnosis — the inside lens → Latest cos table Part C — Why this design works C. Three pillars + one-run walkthrough + file map Reproduce How to reproduce the numbers Companion → IMPLEMENTATION_GUIDE.html

Llama-3.2-1B Verification Subsystem

Two ways to look at the production NPU2 inference pipeline, both comparing against HuggingFace transformers in bf16. Companion to IMPLEMENTATION_GUIDE.html Part C.

Two lenses, one bf16 reference

make verify [MODEL=instruct|base] — the industry-standard correctness gate. 2 prompts × 32 greedy tokens (fast CI gate; make verify-full runs the full 8-prompt sweep), top-5 set inclusion vs HuggingFace transformers bf16 on the NPU end-to-end production path (NPU FlashAttention on, no CPU attention fallback). Lite-mode runners — no inside probing. ~2 minutes / run (verify-full: ~6 minutes). Default MODEL=instruct matches what production stacks deploy.

make diagnosis [MODEL=...] — the inside-probing lens. Single prompt's prefill, per-layer ffn_out cosine + max_abs (NPU vs HF bf16) for all 16 layers. Same end-to-end NPU production path as verify (NPU FlashAttention on). Informational only — diagnosis never fails the run. The verify gate is the correctness signal; this table is what you read by hand when verify flags an issue and you need to localize. ~2 minutes / run.

Why two lenses? verify answers "would this model deploy" using the exact criterion industry uses to qualify a BF16 LLM for production — discrete top-k judgment that is robust to bf16 ULP noise. diagnosis gives localization: a continuous-cosine table per layer that tells you where the NPU implementation drifts most from HF. The verify gate gates; the diagnosis lens informs.

Latest results (2026-05-15):

make verify MODEL=instruct: 8/8 PASS, ~3m41s
make verify MODEL=base: 8/8 PASS, ~3m39s
make diagnosis MODEL=instruct (NPU FA on): cos_p5 in [0.926, 0.993], U-shape with single L1-L2 dip and L10 peak.
make diagnosis MODEL=base (NPU FA on): cos_p5 in [0.929, 0.992], double-dip shape (L1-L3 and L12-L14). Same-checkpoint dependence on prompt + fine-tune is what diagnosis surfaces; both pass verify regardless. See Part B.

A. `make verify` — the correctness gate

The check (mirrors vLLM's `check_logprobs_close`)

Each runner (NPU + HF bf16) greedy-decodes 32 tokens for one prompt, capturing the chosen token + top-5 token IDs at every step.
Walk both sequences in lockstep. Same chosen token → continue. Different chosen tokens → require both to appear in the OTHER side's top-5; otherwise FAIL. Stop walking after the first divergence.
All prompts in the run must pass; any FAIL exits with code 1. make verify runs 2 prompts (fast CI gate); make verify-full runs the full 8.

NPU runs the full production path (GEMV + RMSNorm + RoPE + FlashAttention + LM-head GEMV). Discrete top-k inclusion is robust to bf16 ULP noise: noise routinely flips per-step top-1 between mathematically equivalent implementations but rarely displaces a token from the top-5.

Two prompt sets, matched to checkpoint behavior

#	Base (`verify/prompts/base.txt`)	Instruct (`verify/prompts/instruct.txt`)
0	`GPU stands for`	`Introduce me what is GPU`
1	`The capital of France is`	`Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.`
2	`Artificial intelligence is a branch of computer science that`	`Compare and contrast artificial intelligence with human intelligence in terms of processing information.`
3	`A neural network consists of`	`Describe the basic components of a neural network and how it can be trained.`
4	`Once upon a time, there was a robot who dreamed about`	`Write a short story about a robot that dreams for the first time.`
5	`The COVID-19 pandemic, which began in late 2019,`	`Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.`
6	`The Mona Lisa was painted by`	`Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.`
7	`The French translation of "The early bird catches the worm" is`	`Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'`

Topics deliberately mirror each other so base-vs-instruct comparisons read naturally row-by-row. Base prompts are intentionally incomplete sentences (the base model continues raw text rather than answering instructions). Instruct prompts are imperative requests (7 verbatim from vllm/tests/prompts/example.txt + 1 swapped for project relevance).

Per-prompt results (NPU vs HF bf16, k=5)

For each prompt we display the first divergence step (0-based; step 0 is the prefill prediction, step 1 is the first decode token); each side's chosen token at that step (decoded text, quoted so leading whitespace stays visible) plus its 1-based rank in the OTHER runner's top-5; and the agreed prefix — the actual generated text both runners produced identically before splitting.

Base checkpoint

#	Prompt	Diverge	NPU choice (rank in HF)	HF choice (rank in NPU)	Agreed prefix
0	`GPU stands for`	7	✓ `" special" (#2)`	✓ `" specialized" (#2)`	`" Graphics Processing Unit. It is a"`
1	`The capital of France is`	1	✓ `"," (#2)`	✓ `"." (#2)`	`" Paris"`
2	`Artificial intelligence is…`	7	✓ `"," (#2)`	✓ `"." (#2)`	`" deals with the creation of intelligent machines"`
3	`A neural network consists of`	3	✓ `" nodes" (#2)`	✓ `" interconnected" (#3)`	`" a set of"`
4	`Once upon a time, there was a robot…`	7	✓ `" little" (#2)`	✓ `" robot" (#2)`	`" being a human. He was a"`
5	`The COVID-19 pandemic…`	9	✓ `"," (#2)`	✓ `"." (#2)`	`" has had a significant impact on the global economy"`
6	`The Mona Lisa was painted by`	7	✓ `" and" (#2)`	✓ `"." (#3)`	`" Leonardo da Vinci in 1503"`
7	`The French translation…`	6	✓ `" prend" (#3)`	✓ `" g" (#2)`	`" "Le premier oisif"`

Instruct checkpoint

#	Prompt	Diverge	NPU choice (rank in HF)	HF choice (rank in NPU)	Agreed prefix
0	`Introduce me what is GPU`	0	✓ `" acceleration" (#2)`	✓ `" (" (#2)`	(no prefix)
1	`Briefly describe…`	0	✓ `" Some" (#4)`	✓ `" Key" (#3)`	(no prefix)
2	`Compare and contrast…`	8	✓ `" (" (#4)`	✓ `" are" (#2)`	`" Artificial intelligence (AI) and human intelligence"`
3	`Describe the basic components…`	20	✓ `" multiple" (#2)`	✓ `" three" (#2)`	`" \n\n## Step 1: Define the basic components of a neural network\nA neural network consists of"`
4	`Write a short story…`	11	✓ `" model" (#3)`	✓ `" android" (#2)`	`" It's a robot named Zeta, a highly advanced"`
5	`Analyze the impact of COVID…`	—	✓ (all 32 match)	✓ (all 32 match)	(no divergence within sample)
6	`Explain the cultural significance…`	29	✓ `" Created" (#4)`	✓ `" It" (#2)`	`" \n\nThe Mona Lisa, painted by Leonardo da Vinci in the early 16th century, is one of the most famous paintings in the world."`
7	`Translate the following…`	26	✓ `" Here" (#2)`	✓ `"The" (#2)`	`" This is a common English idiom that means…"`

Both checkpoints PASS the gate. Most divergences are #2/#2 swaps (both runners agreed on the same two top candidates; bf16 noise picked which ranked first); a few are #3/#4. None hit out-of-top-5. On Instruct, prompts 3, 6, 7 reach 20-29 tokens of agreement before splitting, and prompt 5 had zero divergence in the 32-token sample.

B. `make diagnosis` — the inside lens

What it does

Single prompt's prefill on NPU + HF bf16, then computes per-position cosine + element-wise abs error for each layer's ffn_out (the block output). For layers 0…n_layers−2, both sides expose the raw layer output. For layer n_layers−1, both sides expose the post-final-RMSNorm hidden state — HF surfaces this as hidden_states[n_layers] (post-norm by HF v5.3 convention); NPU produces the equivalent via the same final_norm step it does inside its production LM-head GEMV path. So both L15 cells correspond to "the value the LM-head sees".

Diagnosis is informational only. No threshold, no pass/fail, no exit code based on the cosine. Verify is the correctness signal; the diagnosis table tells you where the NPU implementation drifts most from HF (which layer, by how much), which is what you want when triaging a real verify failure or weighing a kernel-side optimization.

Latest cosine tables (NPU FA on, prompt = "The capital of France is")

Same prompt, same NPU end-to-end path, both checkpoints. Run side-by-side so the per-layer precision shape can be compared directly.

Instruct (`meta-llama/Llama-3.2-1B-Instruct`)

Layer	cos_p5	cos_min	cos_median	max_abs
0	0.993269	0.993257	0.993733	0.75
1	0.926400	0.908160	0.990950	22
2	0.927211	0.908539	0.988378	22
3	0.940698	0.927680	0.988209	24
4	0.951836	0.940504	0.987463	26
5	0.959359	0.950193	0.988150	28
6	0.965235	0.958839	0.988398	30
7	0.969200	0.964980	0.988053	30
8	0.975010	0.973589	0.989355	32
9	0.981512	0.980698	0.990487	34
10	0.983873	0.983115	0.990943	36
11	0.981148	0.978896	0.990446	36
12	0.976977	0.973395	0.990023	38
13	0.975324	0.970957	0.989895	42
14	0.971639	0.966981	0.990319	44
15	0.970669	0.966320	0.987503	10.83

Base (`meta-llama/Llama-3.2-1B`)

Layer	cos_p5	cos_min	cos_median	max_abs
0	0.991912	0.991241	0.994038	1.75
1	0.966095	0.959596	0.989646	7
2	0.960257	0.952361	0.988373	6
3	0.958956	0.950566	0.986123	7
4	0.970088	0.965457	0.985988	8
5	0.972773	0.969458	0.985526	9
6	0.974773	0.973999	0.983875	10
7	0.971905	0.968814	0.982661	10
8	0.955578	0.949168	0.987208	11
9	0.960433	0.959102	0.989534	12
10	0.965993	0.965948	0.990815	13
11	0.954954	0.949146	0.990970	13
12	0.941147	0.929415	0.989791	15
13	0.936710	0.923149	0.988866	16
14	0.929362	0.912219	0.987908	17
15	0.939495	0.924292	0.990349	4.013

How to read it

Worst layer on either checkpoint is ~0.93. Comfortably inside the bf16 noise floor (NPU and HF are both bf16, so this is apples-to-apples). Cosine is direction-only, so the underlying per-position direction agreement is high across all 16 layers.
Different fine-tunes have different per-layer shapes.
- Instruct: high at L0 (0.993), single dip at L1-L2 (~0.927), monotonic climb to a peak at L10 (0.984), gradual decline to ~0.971 by L15.
- Base: high at L0 (0.992), early dip at L1-L3 (~0.96), small mid-stack peak at L4-L7 (~0.97), second dip reaching the floor at L12-L14 (~0.93), slight recovery at L15.
Different fine-tuning produces different activation distributions per layer; bf16 round-off interacts with those distributions differently. Both pass verify.
Activation magnitude differs sharply between checkpoints. Base max_abs sits in the 6-17 range; Instruct sits in 22-44. Instruction tuning amplifies certain pathways; the bigger absolute deltas are not a precision problem (cosine is direction-only).
L15 is the post-final-norm cell. max_abs (~10 for Instruct, ~4 for base) is much smaller than mid-stack because final_norm rescales the hidden state to unit-variance-ish magnitude.

C. Why this design verifies production

Three things have to hold for make verify to be a meaningful correctness signal: the version we test must be the version that ships, the reference we compare against must be trustworthy, and the comparison criterion must be sound for bf16. We address each below.

1. NpuRunner runs the actual production code

NpuRunner directly imports and invokes the production functions — no reimplementation:

from llama32_1b_inference import prepare_runtime
from llama32_1b_prefill   import run_transformer_block as run_prefill_block
from llama32_1b_decode    import compile_decode_kernels, run_decode_block

NpuRunner.__init__ compiles the same kernels production compiles and runs the same prepare_runtime setup. NpuRunner.prefill calls run_prefill_block for each of the 16 layers, then runs the production 8-partition LM-head GEMV. NpuRunner.decode_step calls run_decode_block. If NpuRunner produces the right tokens, llama32_1b_inference.py produces the right tokens — by construction.

2. HF transformers in bf16 is the right reference

Criterion	Choice
Canonical	`transformers.AutoModelForCausalLM` is the reference implementation that Meta + HuggingFace + the open-source LLM ecosystem maintain. Every bf16 LLM deployment (vLLM, llama.cpp, TRT-LLM, …) is qualified against this codebase.
Same dtype	Loaded as `torch_dtype=torch.bfloat16`, matching NPU production. Both sides hit the same bf16 round-off characteristics; the comparison is not testing a dtype gap.
Same weights	Both runners load `meta-llama/Llama-3.2-1B[-Instruct]` from the same HF cache. Identical bytes on disk.

HfRunner is ~110 lines that delegate to self.model(input_ids, use_cache=True). No transformer-block reimplementation, no custom kernel — the simpler the reference, the harder it is for the reference to be wrong.

3. Top-k token-level inclusion is the right criterion for bf16

Continuous metrics (cosine, KL) on bf16 logits are fragile: bf16 ULP noise routinely flips per-step top-1 between two mathematically equivalent implementations. Discrete top-k inclusion is robust — bf16 noise can flip top-1 but rarely displaces a token from the top-5. compute_topk_set_check in comparators.py mirrors vLLM's tests/models/utils.py::check_logprobs_close; k=5 and n_tokens=32 are vLLM's defaults for the standard model gate.

One `make verify` run, end to end

Step 1. Load prompts from verify/prompts/{instruct,base}.txt (selected by MODEL). make verify uses the first 4 (fast CI gate); make verify-full uses all 8.

↓

NpuRunner (production prefill + decode kernels, NPU FA on): greedy-decode 32 tokens, capturing chosen[i] + topk[i] (top-5 IDs) per step.

HfRunner (HF transformers in bf16): same 32-token greedy decode, same chosen[i] + topk[i] capture.

↓

Step 3. compute_topk_set_check(npu_chosen, npu_topk, hf_chosen, hf_topk, k=5) walks both sequences in lockstep:

Same chosen → continue.
Different chosen → require both to land in the OTHER side's top-5; status OK or FAIL; stop.

↓

Step 4. Repeat steps 2-3 for every prompt in the run; Report.has_failure() returns True iff any record is FAIL.

↓

Step 5. Write verify_topk_token_*.{json,md}; exit 1 on FAIL else exit 0 (PASS).

What this catches and what it can miss

Catches (every step exercises the entire production stack):

Kernel correctness regressions in GEMV / GEMM / RMSNorm / RoPE / FlashAttention / LM-head GEMV / embedding lookup — a wrong implementation shifts logits enough to push a chosen token out of HF's top-5 within 32 steps on at least one of 8 diverse prompts.
Pipeline glue regressions: KV-cache layout, weight pre-transpose, per-layer BO tagging, LM-head partition aggregation.
Fine-tune-specific behavior: gating Instruct and Base separately catches regressions on either weight distribution.

Can miss:

Bugs that only manifest on prompts outside the 8 (the gate is finite; an lm-eval-harness GSM8K extension would broaden coverage).
Bugs that bias top-1 in a consistent direction without ever pushing a token out of top-5 (e.g., a uniform scale on every logit).
Code paths not exercised by the run (prompts longer than max_seq=2048, etc.).

File map

File	Responsibility
`Makefile` (parent)	`verify` / `diagnosis` / `clean` targets. `MODEL=base\|instruct`, `PROMPT=…` for diagnosis.
`verify/verify_runner.py`	Orchestrator. Builds NPU + HF runners, loops prompts, calls the comparator, writes the report, exits 1 on FAIL.
`verify/comparators.py`	`topk_token_ids` (top-k with argmax-consistent tie-break), `compute_topk_set_check` (top-k token-level inclusion, mirrors vLLM's `check_logprobs_close`), plus diagnosis-only helpers (`per_position_cosine`, `error_metrics`, `compare_pair`).
`verify/report.py`	Report accumulator + JSON / markdown dumpers. `has_failure()` returns True iff any `npu_vs_hf` record is FAIL.
`verify/runners/npu_runner.py`	Imports + invokes the production prefill / decode / LM-head functions.
`verify/runners/hf_runner.py`	Loads `AutoModelForCausalLM` in `torch.bfloat16`; delegates to `model(input_ids, use_cache=True)`.
`verify/runners/_records.py`	`PrefillRecord` / `DecodeStepRecord` dataclasses shared by both runners.
`verify/prompts/instruct.txt`	8 instruction-style prompts (`MODEL=instruct`); 7 from `vllm/tests/prompts/example.txt` + 1 GPU-related swap.
`verify/prompts/base.txt`	8 continuation-style prompts (`MODEL=base`); incomplete sentences matched to base behavior.

Production-side touch points: llama32_1b_prefill.py::run_transformer_block populates ffn_out in the intermediates dict it already returns; diagnosis (which re-runs prefill layer-by-layer) reads it. Verify never reads any per-layer intermediates — it only consumes the final logits + chosen tokens.

How to reproduce these numbers

cd programming_examples/llama32_1b

make verify MODEL=instruct       # ~3m41s — top-k token-level inclusion gate, NPU vs HF bf16 (NPU FA on)
make verify MODEL=base           # ~3m39s — base checkpoint, continuation prompts

make diagnosis MODEL=instruct    # ~2m55s — per-layer ffn_out cosine table (NPU FA on)
make diagnosis MODEL=base        # same lens, base checkpoint

Reports land in verify/reports/{verify_topk_token_,diagnosis_}YYYYMMDD-HHMMSS.{json,md} (gitignored). The chosen MODEL, model_name, and (for verify) prompts_file are recorded in the report config so the file is unambiguous.

Companion: IMPLEMENTATION_GUIDE.html Part C (the original CI smoke that this subsystem extends).