Llama-3.2-1B Performance Profile (NPU2)

Per-step wall-time attribution of the production prefill + decode pipeline, end-to-end. Diagrams mirror the dataflow in IMPLEMENTATION_GUIDE.html Part B1; numbers are reproduced from a single make profile run on NPU2 (AMD Strix), seq_len=2048, MODEL=instruct.

What make profile reports

make profile runs the same code path as make run — the production prefill + decode functions, end to end, real HuggingFace weights — and just enables the otherwise-disabled Profiler instance that cache.load_and_run already records into. There is no profile-only code path; any change to the production functions is automatically reflected in the profile.

The report (printed at the end of the run) opens with an architecture-aware dataflow summary (matches this page’s SVG order) and then dumps generic detail tables per phase (prefill / decode):

SectionWhat it tells you
END-TO-END DATAFLOW (at the top)Architecture-aware walkthrough: tokenize → eos_pad → embed → 16×(rms_gemms_rope + flash_attn + o_ffn + kv_cache_extract) → final_norm → lm_head_gemv. Each row tagged CPU/NPU/— with measured ms. Same ordering as the SVGs in Part A / Part B below. Also prints the one-time Preprocessing (prepare_runtime) wall as a reminder.
Wall-Time AttributionHow the total wall budget splits across NPU XRT calls, CPU host ops, and the layer-loop envelope (sanity check; remainder is python scheduling).
Per-Layer ExecutionOne row per layer for prefill; aggregated avg/min/max across tokens for decode.
NPU XRT Call BreakdownEach multi-launch ELF’s wall time per invocation, plus call count. The granularity is one XRT run = one merged ELF (sub-launches inside the ELF stay opaque, since that’s how production dispatches them).
CPU Op BreakdownEach tracked CPU host operation (tokenize, eos_pad, embed lookup, KV-cache extract, final RMSNorm, decode CPU attention).
Fine-Grained NPU BreakdownEach XRT call further split into BO Write / NPU Run / BO Read (concept explained in Part C).
Per-Token Wall Trend (decode only)Per-token layer-loop wall for token 1 / middle / last + min/max/avg + first→last drift. Lets you see whether per-token latency grows with KV-cache length (decode CPU attention is O(current_pos)). With a 2048-token prompt and 30 decode tokens the drift is typically <1%.

Headline numbers

Snapshot from the report (single run, instruct model, 30 decode tokens):

MetricWallNotes
TTFT (time-to-first-token, prefill end-to-end)~1.28 stokenize + EOS-pad + embed + 16×layer + final RMSNorm + LM head. Matches the vLLM / TGI / TRT-LLM TTFT metric (user-facing latency from request submit to first output token). 95% NPU-bound. Tokenize varies by prompt length; ~10 ms typical.
TPOT (per output token, steady-state decode)~92 ms (10.8 tok/s)16 layers × 4.95 ms each + 13.6 ms LM head + ~0.1 ms host wrappers. Slope vs token index is <1% over 30 tokens (KV cache grows by ~1.5% on a 2048-token prompt).
Preprocessing (one-time, prepare_runtime)~7.6 sCompile external kernels + pre-load weights into per-layer BOs. Happens once per process and is NOT included in TTFT.
CPU host op NPU XRT call (multi-launch ELF) FlashAttention (separate ELF, see B5)

Part A — Prefill (TTFT ~1.28 s)

One inference’s prefill phase: prompt → first generated token. Each box shows the step, where it runs, and the measured wall time. The 16 layers are identical; one iteration is shown in the “decoder block” container.

Tokenize + EOS-pad to seq_len CPU; HF chat template + tokenizer.encode + pad ~10 ms tokenize + ~0 ms pad Token embedding lookup CPU; numpy gather + bf16 cast ~5.8 ms x: [2048, 2048] bf16 Decoder block × L = 16 (one iteration shown; ~77.9 ms per layer; total ~1247 ms) rms_gemms_rope.elf — 1 xrt.run, 6 stitched launches RMSNorm + Q/K/V GEMM + RoPE Q + RoPE K 7.3 ms (BO write 0.5 / NPU 6.5 / BO read 0.1) q_roped, k_roped, v flash_attn.elf — 1 xrt.run, separate ELF 1 launch; un-mergeable (see B5) 21.6 ms (BO write 1.3 / NPU 20.1 / BO read 0.1) attn_out [2048, 2048] o_ffn.elf — 1 xrt.run, 8 stitched launches O + Add + RMSNorm + Gate/Up + SwiGLU + Down + Add 41.0 ms (BO write 1.0 / NPU 39.8 / BO read 0.1) x_next (= next layer's input) KV cache extract & write CPU; reshape + transpose + slice-assign of k_roped, v 1.1 ms per layer (×16 = 17.6 ms) (loop back; 16 layers total) Per layer total: 7.3 + 21.6 + 41.0 + 1.1 = 71.0 ms (kernel+CPU) Layer-loop wall: 77.9 ms → ~7 ms python/numpy scheduling overhead per layer 16 layers × 77.9 ms = 1247 ms x: [2048, 2048] after 16 layers Final RMSNorm @ row pred_pos CPU; only the 1 row needed for next-token argmax 3.1 ms [1, 2048] normed lm_head_gemv.elf — 1 xrt.run, 8 partitions Reuses decode-side ELF for the single-row projection (see A7) 13.6 ms (BO write 0 / NPU 13.5 / BO read 0) logits [1, 128256] → argmax First generated token TTFT (time-to-first-token): ~1280 ms = 10 (tokenize) + ~0 (pad) + 5.8 (embed) + 1247 (16 layers) + 3.1 (norm) + 13.6 (LM head) ≈ 1280 ms NPU XRT 1119 ms (87%) · CPU host 37 ms (3%) · python sched ~125 ms (10%, mostly inside layers)

Prefill: per-kernel and fine-grained tables

NPU XRT calls (16 layer-invocations of each, plus 1 LM head)
ELFLaunchesavg / callBO WriteNPU RunBO ReadBO MB written
rms_gemms_rope6 stitched7.3 ms0.5 ms6.5 ms0.1 ms8.0 MB
flash_attn (separate ELF)121.6 ms1.3 ms20.1 ms0.1 ms20.0 MB
o_ffn8 stitched41.0 ms1.0 ms39.8 ms0.1 ms16.0 MB
lm_head_gemv (prefill end)8 stitched13.6 ms~0 ms13.5 ms~0 ms~0 MB
CPU host ops (prefill side)
OpCountavgTotal
tokenize1~10 ms~10 ms
eos_pad1~0 ms~0 ms
embed_lookup15.8 ms5.8 ms
kv_cache_extract161.1 ms17.6 ms
final_rms_norm13.1 ms3.1 ms
Total CPU20~37 ms

Wall-time attribution check: NPU XRT 1119 ms (16 layer-invocations × 3 kernels + 1 LM head = 49 calls) + CPU host ~37 ms = ~1156 ms accounted, vs. TTFT ~1280 ms → ~125 ms unattributed python/numpy scheduling, mostly inside the layer loop.

Part B — Decode (per token ~92 ms)

Per-token decode step: takes the last produced token, returns the next. Diagram and numbers cover one token; the loop repeats until EOT. Each kernel reflects an avg over 30 decode tokens, 16 layers.

Embed lookup (next-token id → row) CPU; weights.embed_table[id].astype(bf16) ~0 ms (single row gather) x: [2048] bf16 Decoder block × L = 16 (one iteration shown; ~5.0 ms per layer; total ~79 ms) rms_gemv_rope.elf — 1 xrt.run, 6 stitched launches RMSNorm + Q/K/V GEMV + RoPE Q + RoPE K (single token) 0.9 ms (BO write 0 / NPU 0.8 / BO read 0) q_roped [2048]; k_roped, v [512] each decode_attention_cpu CPU single-query attention against KV cache (head_dim=64; FA NPU has too much overhead at single-query) 0.3 ms per layer attn_out [2048] o_gemv_ffn.elf — 1 xrt.run, 8 stitched launches O + Add + RMSNorm + Gate/Up + SwiGLU + Down + Add 3.7 ms (BO write 0 / NPU 3.6 / BO read 0) x_next (= next layer's input) append k,v at pos Per layer total: 0.9 + 0.3 + 3.7 = 4.9 ms (kernel+CPU) Layer-loop wall: 4.95 ms → ~0.05 ms python/numpy overhead per layer 16 layers × 4.95 ms = 79.2 ms x: [2048] after 16 layers Final RMSNorm CPU; single row, F32 internal 0.07 ms [1, 2048] normed lm_head_gemv.elf — 1 xrt.run, 8 partitions 8-partition GEMV stitched in 1 ELF 13.6 ms (NPU 13.5 dominates) logits [1, 128256] → argmax next token id Total per-token wall: ~92 ms = ~0 (embed) + 79.2 (16 layers) + 0.07 (norm) + 13.6 (LM head) ≈ 93 ms NPU XRT ~85 ms (92%) · CPU host ~5 ms (5%) · LM head dominates the per-token bill at 15%

Decode: per-kernel and fine-grained tables

NPU XRT calls (avg over 30 decode tokens × 16 layers)
ELFLaunchesavg / callBO WriteNPU RunBO Read
rms_gemv_rope6 stitched0.9 ms0.02 ms0.83 ms0.01 ms
o_gemv_ffn8 stitched3.7 ms0.02 ms3.64 ms0.01 ms
lm_head_gemv8 stitched13.6 ms0.01 ms13.50 ms0.03 ms
CPU host ops (decode side)
OpCount / tokenavgTotal / token
decode_attention_cpu160.28 ms4.5 ms
embed_lookup1~0 ms~0 ms
final_rms_norm10.07 ms0.07 ms
Total CPU / token18~4.6 ms

Wall-time check: NPU XRT per token = 16 × (0.9 + 3.7) + 13.6 = 87.2 ms · CPU = 4.6 ms · sum 91.8 ms ≈ observed 92 ms wall. Decode is overwhelmingly NPU-bound; the LM head GEMV alone is ~15% of the per-token cost.

Observation: across decode, BO Write is <1% — this is the payoff for pre-loading all weights into per-layer BOs (and marking them static_input_indices) during prepare_runtime. Without that, each layer would re-write its 116 MB of weights per token.

Part C — BO Write / NPU Run / BO Read explained

Each cache.load_and_run("kernel", backend, arg0, ..., argN) invocation is split into three timed segments:

1. BO Write — t_write_ms

For each input/intermediate argument that needs new bytes, the host does memcpy(numpy_data → BO.map()). Args marked static_input_indices (e.g. layer weights) skip this step on every call after prepare_runtime, so steady-state t_write_ms mainly reflects the dynamic inputs that change call-to-call (the input activation, RoPE LUT row, KV-cache slice, …).

What this measures in practice: host-to-DDR memcpy bandwidth for the dynamic inputs only. If you see this rise, either an argument lost its static_input_indices mark, or a normally-small dynamic input grew (e.g. a bigger seq_len).

2. NPU Run — t_kernel_ms

Wall time of xrt.run.start() + xrt.run.wait(). This is the NPU actually executing the multi-launch ELF: DDR → L2/L1 DMAs, AIE-tile compute, and L1/L2 → DDR DMAs of outputs. Host does nothing here except spin-wait the completion signal.

What this measures: real NPU hardware execution time for the ELF. All the multi-launch’s stitched sub-launches (e.g. RMSNorm + Q + K + V + RoPE_Q + RoPE_K inside rms_gemms_rope.elf) run sequentially on-device and are not separately resolved here — that’s by design, because production never dispatches them separately.

3. BO Read — t_read_ms

For each output argument, the host constructs a numpy view over the BO’s mapped memory using np.frombuffer(BO.map(), …). This is zero-copy — no memcpy — and consistently <0.1 ms. If t_read_ms ever climbs into the ms range, that signals an accidental copy was introduced (e.g. an .astype() on a large output).

How they sum

PhaseBO WriteNPU RunBO Read
Prefill (one full pass)~46 ms (4%)~1062 ms (95%)~5 ms (0%)
Decode (per token)~0.6 ms (1%)~86 ms (98%)~0.3 ms (0%)

Both phases are dominated by NPU Run — the host’s job is mostly to feed the right BOs and wait. Decode is even closer to pure-NPU because the per-token dynamic inputs are tiny (a single activation row vs. an entire sequence’s worth).

How to reproduce the numbers

cd programming_examples/llama32_1b

# One-time kernel compilation (~3-4 min, cached)
make compile

# Full profiling report (single run, instruct model)
make profile N_TOKENS=30 PROMPT="Explain photosynthesis in detail."

# Or with the base checkpoint
make profile MODEL=base N_TOKENS=30 PROMPT="Once upon a time"

The report is printed to stdout at the end of the run. To save a copy:

make profile 2>&1 | tee profile_$(date +%Y%m%d-%H%M%S).log

Numbers will jitter ±3-5% between runs (NPU power state, OS scheduling, etc); the breakdown structure is stable. make verify is the orthogonal gate that ensures the production code path producing these numbers is still numerically correct.


Companion: profile.md (textual perf summary, optimization history, vs IRON comparison) · IMPLEMENTATION_GUIDE.html B1 (same dataflow, no timing — shows just the structural picture).