Llama-3.2-1B Verification Subsystem

Two ways to look at the production NPU2 inference pipeline, both comparing against HuggingFace transformers in bf16. Companion to IMPLEMENTATION_GUIDE.html Part C.

Two lenses, one bf16 reference

make verify [MODEL=instruct|base] — the industry-standard correctness gate. 2 prompts × 32 greedy tokens (fast CI gate; make verify-full runs the full 8-prompt sweep), top-5 set inclusion vs HuggingFace transformers bf16 on the NPU end-to-end production path (NPU FlashAttention on, no CPU attention fallback). Lite-mode runners — no inside probing. ~2 minutes / run (verify-full: ~6 minutes). Default MODEL=instruct matches what production stacks deploy.
make diagnosis [MODEL=...] — the inside-probing lens. Single prompt's prefill, per-layer ffn_out cosine + max_abs (NPU vs HF bf16) for all 16 layers. Same end-to-end NPU production path as verify (NPU FlashAttention on). Informational only — diagnosis never fails the run. The verify gate is the correctness signal; this table is what you read by hand when verify flags an issue and you need to localize. ~2 minutes / run.
Why two lenses? verify answers "would this model deploy" using the exact criterion industry uses to qualify a BF16 LLM for production — discrete top-k judgment that is robust to bf16 ULP noise. diagnosis gives localization: a continuous-cosine table per layer that tells you where the NPU implementation drifts most from HF. The verify gate gates; the diagnosis lens informs.
Latest results (2026-05-15):

A. make verify — the correctness gate

The check (mirrors vLLM's check_logprobs_close)

  1. Each runner (NPU + HF bf16) greedy-decodes 32 tokens for one prompt, capturing the chosen token + top-5 token IDs at every step.
  2. Walk both sequences in lockstep. Same chosen token → continue. Different chosen tokens → require both to appear in the OTHER side's top-5; otherwise FAIL. Stop walking after the first divergence.
  3. All prompts in the run must pass; any FAIL exits with code 1. make verify runs 2 prompts (fast CI gate); make verify-full runs the full 8.

NPU runs the full production path (GEMV + RMSNorm + RoPE + FlashAttention + LM-head GEMV). Discrete top-k inclusion is robust to bf16 ULP noise: noise routinely flips per-step top-1 between mathematically equivalent implementations but rarely displaces a token from the top-5.

Two prompt sets, matched to checkpoint behavior

#Base (verify/prompts/base.txt)Instruct (verify/prompts/instruct.txt)
0GPU stands forIntroduce me what is GPU
1The capital of France isBriefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.
2Artificial intelligence is a branch of computer science thatCompare and contrast artificial intelligence with human intelligence in terms of processing information.
3A neural network consists ofDescribe the basic components of a neural network and how it can be trained.
4Once upon a time, there was a robot who dreamed aboutWrite a short story about a robot that dreams for the first time.
5The COVID-19 pandemic, which began in late 2019,Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.
6The Mona Lisa was painted byExplain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.
7The French translation of "The early bird catches the worm" isTranslate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'

Topics deliberately mirror each other so base-vs-instruct comparisons read naturally row-by-row. Base prompts are intentionally incomplete sentences (the base model continues raw text rather than answering instructions). Instruct prompts are imperative requests (7 verbatim from vllm/tests/prompts/example.txt + 1 swapped for project relevance).

Per-prompt results (NPU vs HF bf16, k=5)

For each prompt we display the first divergence step (0-based; step 0 is the prefill prediction, step 1 is the first decode token); each side's chosen token at that step (decoded text, quoted so leading whitespace stays visible) plus its 1-based rank in the OTHER runner's top-5; and the agreed prefix — the actual generated text both runners produced identically before splitting.

Base checkpoint

#PromptDivergeNPU choice (rank in HF)HF choice (rank in NPU)Agreed prefix
0GPU stands for7 " special" (#2) " specialized" (#2)" Graphics Processing Unit. It is a"
1The capital of France is1 "," (#2) "." (#2)" Paris"
2Artificial intelligence is…7 "," (#2) "." (#2)" deals with the creation of intelligent machines"
3A neural network consists of3 " nodes" (#2) " interconnected" (#3)" a set of"
4Once upon a time, there was a robot…7 " little" (#2) " robot" (#2)" being a human. He was a"
5The COVID-19 pandemic…9 "," (#2) "." (#2)" has had a significant impact on the global economy"
6The Mona Lisa was painted by7 " and" (#2) "." (#3)" Leonardo da Vinci in 1503"
7The French translation…6 " prend" (#3) " g" (#2)" "Le premier oisif"

Instruct checkpoint

#PromptDivergeNPU choice (rank in HF)HF choice (rank in NPU)Agreed prefix
0Introduce me what is GPU0 " acceleration" (#2) " (" (#2)(no prefix)
1Briefly describe…0 " Some" (#4) " Key" (#3)(no prefix)
2Compare and contrast…8 " (" (#4) " are" (#2)" Artificial intelligence (AI) and human intelligence"
3Describe the basic components…20 " multiple" (#2) " three" (#2)" \n\n## Step 1: Define the basic components of a neural network\nA neural network consists of"
4Write a short story…11 " model" (#3) " android" (#2)" It's a robot named Zeta, a highly advanced"
5Analyze the impact of COVID… (all 32 match) (all 32 match)(no divergence within sample)
6Explain the cultural significance…29 " Created" (#4) " It" (#2)" \n\nThe Mona Lisa, painted by Leonardo da Vinci in the early 16th century, is one of the most famous paintings in the world."
7Translate the following…26 " Here" (#2) "The" (#2)" This is a common English idiom that means…"

Both checkpoints PASS the gate. Most divergences are #2/#2 swaps (both runners agreed on the same two top candidates; bf16 noise picked which ranked first); a few are #3/#4. None hit out-of-top-5. On Instruct, prompts 3, 6, 7 reach 20-29 tokens of agreement before splitting, and prompt 5 had zero divergence in the 32-token sample.

B. make diagnosis — the inside lens

What it does

Single prompt's prefill on NPU + HF bf16, then computes per-position cosine + element-wise abs error for each layer's ffn_out (the block output). For layers 0…n_layers−2, both sides expose the raw layer output. For layer n_layers−1, both sides expose the post-final-RMSNorm hidden state — HF surfaces this as hidden_states[n_layers] (post-norm by HF v5.3 convention); NPU produces the equivalent via the same final_norm step it does inside its production LM-head GEMV path. So both L15 cells correspond to "the value the LM-head sees".

Diagnosis is informational only. No threshold, no pass/fail, no exit code based on the cosine. Verify is the correctness signal; the diagnosis table tells you where the NPU implementation drifts most from HF (which layer, by how much), which is what you want when triaging a real verify failure or weighing a kernel-side optimization.

Latest cosine tables (NPU FA on, prompt = "The capital of France is")

Same prompt, same NPU end-to-end path, both checkpoints. Run side-by-side so the per-layer precision shape can be compared directly.

Instruct (meta-llama/Llama-3.2-1B-Instruct)

Layercos_p5cos_mincos_medianmax_abs
00.9932690.9932570.9937330.75
10.9264000.9081600.99095022
20.9272110.9085390.98837822
30.9406980.9276800.98820924
40.9518360.9405040.98746326
50.9593590.9501930.98815028
60.9652350.9588390.98839830
70.9692000.9649800.98805330
80.9750100.9735890.98935532
90.9815120.9806980.99048734
100.9838730.9831150.99094336
110.9811480.9788960.99044636
120.9769770.9733950.99002338
130.9753240.9709570.98989542
140.9716390.9669810.99031944
150.9706690.9663200.98750310.83

Base (meta-llama/Llama-3.2-1B)

Layercos_p5cos_mincos_medianmax_abs
00.9919120.9912410.9940381.75
10.9660950.9595960.9896467
20.9602570.9523610.9883736
30.9589560.9505660.9861237
40.9700880.9654570.9859888
50.9727730.9694580.9855269
60.9747730.9739990.98387510
70.9719050.9688140.98266110
80.9555780.9491680.98720811
90.9604330.9591020.98953412
100.9659930.9659480.99081513
110.9549540.9491460.99097013
120.9411470.9294150.98979115
130.9367100.9231490.98886616
140.9293620.9122190.98790817
150.9394950.9242920.9903494.013

How to read it

  1. Worst layer on either checkpoint is ~0.93. Comfortably inside the bf16 noise floor (NPU and HF are both bf16, so this is apples-to-apples). Cosine is direction-only, so the underlying per-position direction agreement is high across all 16 layers.
  2. Different fine-tunes have different per-layer shapes. Different fine-tuning produces different activation distributions per layer; bf16 round-off interacts with those distributions differently. Both pass verify.
  3. Activation magnitude differs sharply between checkpoints. Base max_abs sits in the 6-17 range; Instruct sits in 22-44. Instruction tuning amplifies certain pathways; the bigger absolute deltas are not a precision problem (cosine is direction-only).
  4. L15 is the post-final-norm cell. max_abs (~10 for Instruct, ~4 for base) is much smaller than mid-stack because final_norm rescales the hidden state to unit-variance-ish magnitude.

C. Why this design verifies production

Three things have to hold for make verify to be a meaningful correctness signal: the version we test must be the version that ships, the reference we compare against must be trustworthy, and the comparison criterion must be sound for bf16. We address each below.

1. NpuRunner runs the actual production code

NpuRunner directly imports and invokes the production functions — no reimplementation:

from llama32_1b_inference import prepare_runtime
from llama32_1b_prefill   import run_transformer_block as run_prefill_block
from llama32_1b_decode    import compile_decode_kernels, run_decode_block

NpuRunner.__init__ compiles the same kernels production compiles and runs the same prepare_runtime setup. NpuRunner.prefill calls run_prefill_block for each of the 16 layers, then runs the production 8-partition LM-head GEMV. NpuRunner.decode_step calls run_decode_block. If NpuRunner produces the right tokens, llama32_1b_inference.py produces the right tokens — by construction.

2. HF transformers in bf16 is the right reference

CriterionChoice
Canonicaltransformers.AutoModelForCausalLM is the reference implementation that Meta + HuggingFace + the open-source LLM ecosystem maintain. Every bf16 LLM deployment (vLLM, llama.cpp, TRT-LLM, …) is qualified against this codebase.
Same dtypeLoaded as torch_dtype=torch.bfloat16, matching NPU production. Both sides hit the same bf16 round-off characteristics; the comparison is not testing a dtype gap.
Same weightsBoth runners load meta-llama/Llama-3.2-1B[-Instruct] from the same HF cache. Identical bytes on disk.

HfRunner is ~110 lines that delegate to self.model(input_ids, use_cache=True). No transformer-block reimplementation, no custom kernel — the simpler the reference, the harder it is for the reference to be wrong.

3. Top-k token-level inclusion is the right criterion for bf16

Continuous metrics (cosine, KL) on bf16 logits are fragile: bf16 ULP noise routinely flips per-step top-1 between two mathematically equivalent implementations. Discrete top-k inclusion is robust — bf16 noise can flip top-1 but rarely displaces a token from the top-5. compute_topk_set_check in comparators.py mirrors vLLM's tests/models/utils.py::check_logprobs_close; k=5 and n_tokens=32 are vLLM's defaults for the standard model gate.

One make verify run, end to end

Step 1. Load prompts from verify/prompts/{instruct,base}.txt (selected by MODEL). make verify uses the first 4 (fast CI gate); make verify-full uses all 8.
NpuRunner (production prefill + decode kernels, NPU FA on): greedy-decode 32 tokens, capturing chosen[i] + topk[i] (top-5 IDs) per step.
HfRunner (HF transformers in bf16): same 32-token greedy decode, same chosen[i] + topk[i] capture.
Step 3. compute_topk_set_check(npu_chosen, npu_topk, hf_chosen, hf_topk, k=5) walks both sequences in lockstep:
Step 4. Repeat steps 2-3 for every prompt in the run; Report.has_failure() returns True iff any record is FAIL.
Step 5. Write verify_topk_token_*.{json,md}; exit 1 on FAIL else exit 0 (PASS).

What this catches and what it can miss

Catches (every step exercises the entire production stack):

Can miss:

File map

FileResponsibility
Makefile (parent)verify / diagnosis / clean targets. MODEL=base|instruct, PROMPT=… for diagnosis.
verify/verify_runner.pyOrchestrator. Builds NPU + HF runners, loops prompts, calls the comparator, writes the report, exits 1 on FAIL.
verify/comparators.pytopk_token_ids (top-k with argmax-consistent tie-break), compute_topk_set_check (top-k token-level inclusion, mirrors vLLM's check_logprobs_close), plus diagnosis-only helpers (per_position_cosine, error_metrics, compare_pair).
verify/report.pyReport accumulator + JSON / markdown dumpers. has_failure() returns True iff any npu_vs_hf record is FAIL.
verify/runners/npu_runner.pyImports + invokes the production prefill / decode / LM-head functions.
verify/runners/hf_runner.pyLoads AutoModelForCausalLM in torch.bfloat16; delegates to model(input_ids, use_cache=True).
verify/runners/_records.pyPrefillRecord / DecodeStepRecord dataclasses shared by both runners.
verify/prompts/instruct.txt8 instruction-style prompts (MODEL=instruct); 7 from vllm/tests/prompts/example.txt + 1 GPU-related swap.
verify/prompts/base.txt8 continuation-style prompts (MODEL=base); incomplete sentences matched to base behavior.

Production-side touch points: llama32_1b_prefill.py::run_transformer_block populates ffn_out in the intermediates dict it already returns; diagnosis (which re-runs prefill layer-by-layer) reads it. Verify never reads any per-layer intermediates — it only consumes the final logits + chosen tokens.

How to reproduce these numbers

cd programming_examples/llama32_1b

make verify MODEL=instruct       # ~3m41s — top-k token-level inclusion gate, NPU vs HF bf16 (NPU FA on)
make verify MODEL=base           # ~3m39s — base checkpoint, continuation prompts

make diagnosis MODEL=instruct    # ~2m55s — per-layer ffn_out cosine table (NPU FA on)
make diagnosis MODEL=base        # same lens, base checkpoint

Reports land in verify/reports/{verify_topk_token_,diagnosis_}YYYYMMDD-HHMMSS.{json,md} (gitignored). The chosen MODEL, model_name, and (for verify) prompts_file are recorded in the report config so the file is unambiguous.


Companion: IMPLEMENTATION_GUIDE.html Part C (the original CI smoke that this subsystem extends).