Two ways to look at the production NPU2 inference pipeline, both comparing against HuggingFace transformers in bf16. Companion to IMPLEMENTATION_GUIDE.html Part C.
make verify [MODEL=instruct|base] — the industry-standard correctness gate. 2 prompts × 32 greedy tokens (fast CI gate; make verify-full runs the full 8-prompt sweep), top-5 set inclusion vs HuggingFace transformers bf16 on the NPU end-to-end production path (NPU FlashAttention on, no CPU attention fallback). Lite-mode runners — no inside probing. ~2 minutes / run (verify-full: ~6 minutes). Default MODEL=instruct matches what production stacks deploy.
make diagnosis [MODEL=...] — the inside-probing lens. Single prompt's prefill, per-layer ffn_out cosine + max_abs (NPU vs HF bf16) for all 16 layers. Same end-to-end NPU production path as verify (NPU FlashAttention on). Informational only — diagnosis never fails the run. The verify gate is the correctness signal; this table is what you read by hand when verify flags an issue and you need to localize. ~2 minutes / run.
verify answers "would this model deploy" using the exact criterion industry uses to qualify a BF16 LLM for production — discrete top-k judgment that is robust to bf16 ULP noise. diagnosis gives localization: a continuous-cosine table per layer that tells you where the NPU implementation drifts most from HF. The verify gate gates; the diagnosis lens informs.
make verify MODEL=instruct: 8/8 PASS, ~3m41smake verify MODEL=base: 8/8 PASS, ~3m39smake diagnosis MODEL=instruct (NPU FA on): cos_p5 in [0.926, 0.993], U-shape with single L1-L2 dip and L10 peak.make diagnosis MODEL=base (NPU FA on): cos_p5 in [0.929, 0.992], double-dip shape (L1-L3 and L12-L14). Same-checkpoint dependence on prompt + fine-tune is what diagnosis surfaces; both pass verify regardless. See Part B.make verify — the correctness gatecheck_logprobs_close)make verify runs 2 prompts (fast CI gate); make verify-full runs the full 8.NPU runs the full production path (GEMV + RMSNorm + RoPE + FlashAttention + LM-head GEMV). Discrete top-k inclusion is robust to bf16 ULP noise: noise routinely flips per-step top-1 between mathematically equivalent implementations but rarely displaces a token from the top-5.
| # | Base (verify/prompts/base.txt) | Instruct (verify/prompts/instruct.txt) |
|---|---|---|
| 0 | GPU stands for | Introduce me what is GPU |
| 1 | The capital of France is | Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020. |
| 2 | Artificial intelligence is a branch of computer science that | Compare and contrast artificial intelligence with human intelligence in terms of processing information. |
| 3 | A neural network consists of | Describe the basic components of a neural network and how it can be trained. |
| 4 | Once upon a time, there was a robot who dreamed about | Write a short story about a robot that dreams for the first time. |
| 5 | The COVID-19 pandemic, which began in late 2019, | Analyze the impact of the COVID-19 pandemic on global economic structures and future business models. |
| 6 | The Mona Lisa was painted by | Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies. |
| 7 | The French translation of "The early bird catches the worm" is | Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.' |
Topics deliberately mirror each other so base-vs-instruct comparisons read naturally row-by-row. Base prompts are intentionally incomplete sentences (the base model continues raw text rather than answering instructions). Instruct prompts are imperative requests (7 verbatim from vllm/tests/prompts/example.txt + 1 swapped for project relevance).
For each prompt we display the first divergence step (0-based; step 0 is the prefill prediction, step 1 is the first decode token); each side's chosen token at that step (decoded text, quoted so leading whitespace stays visible) plus its 1-based rank in the OTHER runner's top-5; and the agreed prefix — the actual generated text both runners produced identically before splitting.
| # | Prompt | Diverge | NPU choice (rank in HF) | HF choice (rank in NPU) | Agreed prefix |
|---|---|---|---|---|---|
| 0 | GPU stands for | 7 | ✓ " special" (#2) | ✓ " specialized" (#2) | " Graphics Processing Unit. It is a" |
| 1 | The capital of France is | 1 | ✓ "," (#2) | ✓ "." (#2) | " Paris" |
| 2 | Artificial intelligence is… | 7 | ✓ "," (#2) | ✓ "." (#2) | " deals with the creation of intelligent machines" |
| 3 | A neural network consists of | 3 | ✓ " nodes" (#2) | ✓ " interconnected" (#3) | " a set of" |
| 4 | Once upon a time, there was a robot… | 7 | ✓ " little" (#2) | ✓ " robot" (#2) | " being a human. He was a" |
| 5 | The COVID-19 pandemic… | 9 | ✓ "," (#2) | ✓ "." (#2) | " has had a significant impact on the global economy" |
| 6 | The Mona Lisa was painted by | 7 | ✓ " and" (#2) | ✓ "." (#3) | " Leonardo da Vinci in 1503" |
| 7 | The French translation… | 6 | ✓ " prend" (#3) | ✓ " g" (#2) | " "Le premier oisif" |
| # | Prompt | Diverge | NPU choice (rank in HF) | HF choice (rank in NPU) | Agreed prefix |
|---|---|---|---|---|---|
| 0 | Introduce me what is GPU | 0 | ✓ " acceleration" (#2) | ✓ " (" (#2) | (no prefix) |
| 1 | Briefly describe… | 0 | ✓ " Some" (#4) | ✓ " Key" (#3) | (no prefix) |
| 2 | Compare and contrast… | 8 | ✓ " (" (#4) | ✓ " are" (#2) | " Artificial intelligence (AI) and human intelligence" |
| 3 | Describe the basic components… | 20 | ✓ " multiple" (#2) | ✓ " three" (#2) | " \n\n## Step 1: Define the basic components of a neural network\nA neural network consists of" |
| 4 | Write a short story… | 11 | ✓ " model" (#3) | ✓ " android" (#2) | " It's a robot named Zeta, a highly advanced" |
| 5 | Analyze the impact of COVID… | — | ✓ (all 32 match) | ✓ (all 32 match) | (no divergence within sample) |
| 6 | Explain the cultural significance… | 29 | ✓ " Created" (#4) | ✓ " It" (#2) | " \n\nThe Mona Lisa, painted by Leonardo da Vinci in the early 16th century, is one of the most famous paintings in the world." |
| 7 | Translate the following… | 26 | ✓ " Here" (#2) | ✓ "The" (#2) | " This is a common English idiom that means…" |
Both checkpoints PASS the gate. Most divergences are #2/#2 swaps (both runners agreed on the same two top candidates; bf16 noise picked which ranked first); a few are #3/#4. None hit out-of-top-5. On Instruct, prompts 3, 6, 7 reach 20-29 tokens of agreement before splitting, and prompt 5 had zero divergence in the 32-token sample.
make diagnosis — the inside lensSingle prompt's prefill on NPU + HF bf16, then computes per-position cosine + element-wise abs error for each layer's ffn_out (the block output). For layers 0…n_layers−2, both sides expose the raw layer output. For layer n_layers−1, both sides expose the post-final-RMSNorm hidden state — HF surfaces this as hidden_states[n_layers] (post-norm by HF v5.3 convention); NPU produces the equivalent via the same final_norm step it does inside its production LM-head GEMV path. So both L15 cells correspond to "the value the LM-head sees".
Diagnosis is informational only. No threshold, no pass/fail, no exit code based on the cosine. Verify is the correctness signal; the diagnosis table tells you where the NPU implementation drifts most from HF (which layer, by how much), which is what you want when triaging a real verify failure or weighing a kernel-side optimization.
Same prompt, same NPU end-to-end path, both checkpoints. Run side-by-side so the per-layer precision shape can be compared directly.
meta-llama/Llama-3.2-1B-Instruct)| Layer | cos_p5 | cos_min | cos_median | max_abs |
|---|---|---|---|---|
| 0 | 0.993269 | 0.993257 | 0.993733 | 0.75 |
| 1 | 0.926400 | 0.908160 | 0.990950 | 22 |
| 2 | 0.927211 | 0.908539 | 0.988378 | 22 |
| 3 | 0.940698 | 0.927680 | 0.988209 | 24 |
| 4 | 0.951836 | 0.940504 | 0.987463 | 26 |
| 5 | 0.959359 | 0.950193 | 0.988150 | 28 |
| 6 | 0.965235 | 0.958839 | 0.988398 | 30 |
| 7 | 0.969200 | 0.964980 | 0.988053 | 30 |
| 8 | 0.975010 | 0.973589 | 0.989355 | 32 |
| 9 | 0.981512 | 0.980698 | 0.990487 | 34 |
| 10 | 0.983873 | 0.983115 | 0.990943 | 36 |
| 11 | 0.981148 | 0.978896 | 0.990446 | 36 |
| 12 | 0.976977 | 0.973395 | 0.990023 | 38 |
| 13 | 0.975324 | 0.970957 | 0.989895 | 42 |
| 14 | 0.971639 | 0.966981 | 0.990319 | 44 |
| 15 | 0.970669 | 0.966320 | 0.987503 | 10.83 |
meta-llama/Llama-3.2-1B)| Layer | cos_p5 | cos_min | cos_median | max_abs |
|---|---|---|---|---|
| 0 | 0.991912 | 0.991241 | 0.994038 | 1.75 |
| 1 | 0.966095 | 0.959596 | 0.989646 | 7 |
| 2 | 0.960257 | 0.952361 | 0.988373 | 6 |
| 3 | 0.958956 | 0.950566 | 0.986123 | 7 |
| 4 | 0.970088 | 0.965457 | 0.985988 | 8 |
| 5 | 0.972773 | 0.969458 | 0.985526 | 9 |
| 6 | 0.974773 | 0.973999 | 0.983875 | 10 |
| 7 | 0.971905 | 0.968814 | 0.982661 | 10 |
| 8 | 0.955578 | 0.949168 | 0.987208 | 11 |
| 9 | 0.960433 | 0.959102 | 0.989534 | 12 |
| 10 | 0.965993 | 0.965948 | 0.990815 | 13 |
| 11 | 0.954954 | 0.949146 | 0.990970 | 13 |
| 12 | 0.941147 | 0.929415 | 0.989791 | 15 |
| 13 | 0.936710 | 0.923149 | 0.988866 | 16 |
| 14 | 0.929362 | 0.912219 | 0.987908 | 17 |
| 15 | 0.939495 | 0.924292 | 0.990349 | 4.013 |
max_abs sits in the 6-17 range; Instruct sits in 22-44. Instruction tuning amplifies certain pathways; the bigger absolute deltas are not a precision problem (cosine is direction-only).max_abs (~10 for Instruct, ~4 for base) is much smaller than mid-stack because final_norm rescales the hidden state to unit-variance-ish magnitude.Three things have to hold for make verify to be a meaningful correctness signal: the version we test must be the version that ships, the reference we compare against must be trustworthy, and the comparison criterion must be sound for bf16. We address each below.
NpuRunner directly imports and invokes the production functions — no reimplementation:
from llama32_1b_inference import prepare_runtime
from llama32_1b_prefill import run_transformer_block as run_prefill_block
from llama32_1b_decode import compile_decode_kernels, run_decode_block
NpuRunner.__init__ compiles the same kernels production compiles and runs the same prepare_runtime setup. NpuRunner.prefill calls run_prefill_block for each of the 16 layers, then runs the production 8-partition LM-head GEMV. NpuRunner.decode_step calls run_decode_block. If NpuRunner produces the right tokens, llama32_1b_inference.py produces the right tokens — by construction.
| Criterion | Choice |
|---|---|
| Canonical | transformers.AutoModelForCausalLM is the reference implementation that Meta + HuggingFace + the open-source LLM ecosystem maintain. Every bf16 LLM deployment (vLLM, llama.cpp, TRT-LLM, …) is qualified against this codebase. |
| Same dtype | Loaded as torch_dtype=torch.bfloat16, matching NPU production. Both sides hit the same bf16 round-off characteristics; the comparison is not testing a dtype gap. |
| Same weights | Both runners load meta-llama/Llama-3.2-1B[-Instruct] from the same HF cache. Identical bytes on disk. |
HfRunner is ~110 lines that delegate to self.model(input_ids, use_cache=True). No transformer-block reimplementation, no custom kernel — the simpler the reference, the harder it is for the reference to be wrong.
Continuous metrics (cosine, KL) on bf16 logits are fragile: bf16 ULP noise routinely flips per-step top-1 between two mathematically equivalent implementations. Discrete top-k inclusion is robust — bf16 noise can flip top-1 but rarely displaces a token from the top-5. compute_topk_set_check in comparators.py mirrors vLLM's tests/models/utils.py::check_logprobs_close; k=5 and n_tokens=32 are vLLM's defaults for the standard model gate.
make verify run, end to endverify/prompts/{instruct,base}.txt (selected by MODEL). make verify uses the first 4 (fast CI gate); make verify-full uses all 8.chosen[i] + topk[i] (top-5 IDs) per step.chosen[i] + topk[i] capture.compute_topk_set_check(npu_chosen, npu_topk, hf_chosen, hf_topk, k=5) walks both sequences in lockstep:
OK or FAIL; stop.Report.has_failure() returns True iff any record is FAIL.verify_topk_token_*.{json,md}; exit 1 on FAIL else exit 0 (PASS).Catches (every step exercises the entire production stack):
Can miss:
| File | Responsibility |
|---|---|
Makefile (parent) | verify / diagnosis / clean targets. MODEL=base|instruct, PROMPT=… for diagnosis. |
verify/verify_runner.py | Orchestrator. Builds NPU + HF runners, loops prompts, calls the comparator, writes the report, exits 1 on FAIL. |
verify/comparators.py | topk_token_ids (top-k with argmax-consistent tie-break), compute_topk_set_check (top-k token-level inclusion, mirrors vLLM's check_logprobs_close), plus diagnosis-only helpers (per_position_cosine, error_metrics, compare_pair). |
verify/report.py | Report accumulator + JSON / markdown dumpers. has_failure() returns True iff any npu_vs_hf record is FAIL. |
verify/runners/npu_runner.py | Imports + invokes the production prefill / decode / LM-head functions. |
verify/runners/hf_runner.py | Loads AutoModelForCausalLM in torch.bfloat16; delegates to model(input_ids, use_cache=True). |
verify/runners/_records.py | PrefillRecord / DecodeStepRecord dataclasses shared by both runners. |
verify/prompts/instruct.txt | 8 instruction-style prompts (MODEL=instruct); 7 from vllm/tests/prompts/example.txt + 1 GPU-related swap. |
verify/prompts/base.txt | 8 continuation-style prompts (MODEL=base); incomplete sentences matched to base behavior. |
Production-side touch points: llama32_1b_prefill.py::run_transformer_block populates ffn_out in the intermediates dict it already returns; diagnosis (which re-runs prefill layer-by-layer) reads it. Verify never reads any per-layer intermediates — it only consumes the final logits + chosen tokens.
cd programming_examples/llama32_1b
make verify MODEL=instruct # ~3m41s — top-k token-level inclusion gate, NPU vs HF bf16 (NPU FA on)
make verify MODEL=base # ~3m39s — base checkpoint, continuation prompts
make diagnosis MODEL=instruct # ~2m55s — per-layer ffn_out cosine table (NPU FA on)
make diagnosis MODEL=base # same lens, base checkpoint
Reports land in verify/reports/{verify_topk_token_,diagnosis_}YYYYMMDD-HHMMSS.{json,md} (gitignored). The chosen MODEL, model_name, and (for verify) prompts_file are recorded in the report config so the file is unambiguous.
Companion: IMPLEMENTATION_GUIDE.html Part C (the original CI smoke that this subsystem extends).