Navigation

Overview What make profile reports Headline numbers (TTFT 1.28 s + 92 ms/tok) Part A — Prefill Dataflow + per-step timing Per-kernel + fine-grained tables Part B — Decode Dataflow + per-step timing Per-kernel + fine-grained tables Part C — Concepts BO Write / NPU Run / BO Read Wall-time attribution Reproduce How to regenerate these numbers Companion → profile.md → IMPLEMENTATION_GUIDE.html

Llama-3.2-1B Performance Profile (NPU2)

Per-step wall-time attribution of the production prefill + decode pipeline, end-to-end. Diagrams mirror the dataflow in IMPLEMENTATION_GUIDE.html Part B1; numbers are reproduced from a single make profile run on NPU2 (AMD Strix), seq_len=2048, MODEL=instruct.

What `make profile` reports

make profile runs the same code path as make run — the production prefill + decode functions, end to end, real HuggingFace weights — and just enables the otherwise-disabled Profiler instance that cache.load_and_run already records into. There is no profile-only code path; any change to the production functions is automatically reflected in the profile.

The report (printed at the end of the run) opens with an architecture-aware dataflow summary (matches this page’s SVG order) and then dumps generic detail tables per phase (prefill / decode):

Section	What it tells you
END-TO-END DATAFLOW (at the top)	Architecture-aware walkthrough: `tokenize → eos_pad → embed → 16×(rms_gemms_rope + flash_attn + o_ffn + kv_cache_extract) → final_norm → lm_head_gemv`. Each row tagged CPU/NPU/— with measured ms. Same ordering as the SVGs in Part A / Part B below. Also prints the one-time Preprocessing (prepare_runtime) wall as a reminder.
Wall-Time Attribution	How the total wall budget splits across NPU XRT calls, CPU host ops, and the layer-loop envelope (sanity check; remainder is python scheduling).
Per-Layer Execution	One row per layer for prefill; aggregated avg/min/max across tokens for decode.
NPU XRT Call Breakdown	Each multi-launch ELF’s wall time per invocation, plus call count. The granularity is one XRT run = one merged ELF (sub-launches inside the ELF stay opaque, since that’s how production dispatches them).
CPU Op Breakdown	Each tracked CPU host operation (tokenize, eos_pad, embed lookup, KV-cache extract, final RMSNorm, decode CPU attention).
Fine-Grained NPU Breakdown	Each XRT call further split into BO Write / NPU Run / BO Read (concept explained in Part C).
Per-Token Wall Trend (decode only)	Per-token layer-loop wall for token 1 / middle / last + min/max/avg + first→last drift. Lets you see whether per-token latency grows with KV-cache length (decode CPU attention is O(current_pos)). With a 2048-token prompt and 30 decode tokens the drift is typically <1%.

Headline numbers

Snapshot from the report (single run, instruct model, 30 decode tokens):

Metric	Wall	Notes
TTFT (time-to-first-token, prefill end-to-end)	~1.28 s	tokenize + EOS-pad + embed + 16×layer + final RMSNorm + LM head. Matches the vLLM / TGI / TRT-LLM TTFT metric (user-facing latency from request submit to first output token). 95% NPU-bound. Tokenize varies by prompt length; ~10 ms typical.
TPOT (per output token, steady-state decode)	~92 ms (10.8 tok/s)	16 layers × 4.95 ms each + 13.6 ms LM head + ~0.1 ms host wrappers. Slope vs token index is <1% over 30 tokens (KV cache grows by ~1.5% on a 2048-token prompt).
Preprocessing (one-time, `prepare_runtime`)	~7.6 s	Compile external kernels + pre-load weights into per-layer BOs. Happens once per process and is NOT included in TTFT.

CPU host op NPU XRT call (multi-launch ELF) FlashAttention (separate ELF, see B5)

Part A — Prefill (TTFT ~1.28 s)

One inference’s prefill phase: prompt → first generated token. Each box shows the step, where it runs, and the measured wall time. The 16 layers are identical; one iteration is shown in the “decoder block” container.

Prefill: per-kernel and fine-grained tables

NPU XRT calls (16 layer-invocations of each, plus 1 LM head)
ELF	Launches	avg / call	BO Write	NPU Run	BO Read	BO MB written
`rms_gemms_rope`	6 stitched	7.3 ms	0.5 ms	6.5 ms	0.1 ms	8.0 MB
`flash_attn` (separate ELF)	1	21.6 ms	1.3 ms	20.1 ms	0.1 ms	20.0 MB
`o_ffn`	8 stitched	41.0 ms	1.0 ms	39.8 ms	0.1 ms	16.0 MB
`lm_head_gemv` (prefill end)	8 stitched	13.6 ms	~0 ms	13.5 ms	~0 ms	~0 MB

CPU host ops (prefill side)
Op	Count	avg	Total
`tokenize`	1	~10 ms	~10 ms
`eos_pad`	1	~0 ms	~0 ms
`embed_lookup`	1	5.8 ms	5.8 ms
`kv_cache_extract`	16	1.1 ms	17.6 ms
`final_rms_norm`	1	3.1 ms	3.1 ms
Total CPU	20	—	~37 ms

Wall-time attribution check: NPU XRT 1119 ms (16 layer-invocations × 3 kernels + 1 LM head = 49 calls) + CPU host ~37 ms = ~1156 ms accounted, vs. TTFT ~1280 ms → ~125 ms unattributed python/numpy scheduling, mostly inside the layer loop.

Part B — Decode (per token ~92 ms)

Per-token decode step: takes the last produced token, returns the next. Diagram and numbers cover one token; the loop repeats until EOT. Each kernel reflects an avg over 30 decode tokens, 16 layers.

Decode: per-kernel and fine-grained tables

NPU XRT calls (avg over 30 decode tokens × 16 layers)
ELF	Launches	avg / call	BO Write	NPU Run	BO Read
`rms_gemv_rope`	6 stitched	0.9 ms	0.02 ms	0.83 ms	0.01 ms
`o_gemv_ffn`	8 stitched	3.7 ms	0.02 ms	3.64 ms	0.01 ms
`lm_head_gemv`	8 stitched	13.6 ms	0.01 ms	13.50 ms	0.03 ms

CPU host ops (decode side)
Op	Count / token	avg	Total / token
`decode_attention_cpu`	16	0.28 ms	4.5 ms
`embed_lookup`	1	~0 ms	~0 ms
`final_rms_norm`	1	0.07 ms	0.07 ms
Total CPU / token	18	—	~4.6 ms

Wall-time check: NPU XRT per token = 16 × (0.9 + 3.7) + 13.6 = 87.2 ms · CPU = 4.6 ms · sum 91.8 ms ≈ observed 92 ms wall. Decode is overwhelmingly NPU-bound; the LM head GEMV alone is ~15% of the per-token cost.

Observation: across decode, BO Write is <1% — this is the payoff for pre-loading all weights into per-layer BOs (and marking them static_input_indices) during prepare_runtime. Without that, each layer would re-write its 116 MB of weights per token.

Part C — BO Write / NPU Run / BO Read explained

Each cache.load_and_run("kernel", backend, arg0, ..., argN) invocation is split into three timed segments:

1. BO Write — `t_write_ms`

For each input/intermediate argument that needs new bytes, the host does memcpy(numpy_data → BO.map()). Args marked static_input_indices (e.g. layer weights) skip this step on every call after prepare_runtime, so steady-state t_write_ms mainly reflects the dynamic inputs that change call-to-call (the input activation, RoPE LUT row, KV-cache slice, …).

What this measures in practice: host-to-DDR memcpy bandwidth for the dynamic inputs only. If you see this rise, either an argument lost its static_input_indices mark, or a normally-small dynamic input grew (e.g. a bigger seq_len).

2. NPU Run — `t_kernel_ms`

Wall time of xrt.run.start() + xrt.run.wait(). This is the NPU actually executing the multi-launch ELF: DDR → L2/L1 DMAs, AIE-tile compute, and L1/L2 → DDR DMAs of outputs. Host does nothing here except spin-wait the completion signal.

What this measures: real NPU hardware execution time for the ELF. All the multi-launch’s stitched sub-launches (e.g. RMSNorm + Q + K + V + RoPE_Q + RoPE_K inside rms_gemms_rope.elf) run sequentially on-device and are not separately resolved here — that’s by design, because production never dispatches them separately.

3. BO Read — `t_read_ms`

For each output argument, the host constructs a numpy view over the BO’s mapped memory using np.frombuffer(BO.map(), …). This is zero-copy — no memcpy — and consistently <0.1 ms. If t_read_ms ever climbs into the ms range, that signals an accidental copy was introduced (e.g. an .astype() on a large output).

How they sum

Phase	BO Write	NPU Run	BO Read
Prefill (one full pass)	~46 ms (4%)	~1062 ms (95%)	~5 ms (0%)
Decode (per token)	~0.6 ms (1%)	~86 ms (98%)	~0.3 ms (0%)

Both phases are dominated by NPU Run — the host’s job is mostly to feed the right BOs and wait. Decode is even closer to pure-NPU because the per-token dynamic inputs are tiny (a single activation row vs. an entire sequence’s worth).

How to reproduce the numbers

cd programming_examples/llama32_1b

# One-time kernel compilation (~3-4 min, cached)
make compile

# Full profiling report (single run, instruct model)
make profile N_TOKENS=30 PROMPT="Explain photosynthesis in detail."

# Or with the base checkpoint
make profile MODEL=base N_TOKENS=30 PROMPT="Once upon a time"

The report is printed to stdout at the end of the run. To save a copy:

make profile 2>&1 | tee profile_$(date +%Y%m%d-%H%M%S).log

Numbers will jitter ±3-5% between runs (NPU power state, OS scheduling, etc); the breakdown structure is stable. make verify is the orthogonal gate that ensures the production code path producing these numbers is still numerically correct.

Companion: profile.md (textual perf summary, optimization history, vs IRON comparison) · IMPLEMENTATION_GUIDE.html B1 (same dataflow, no timing — shows just the structural picture).