Per-step wall-time attribution of the production prefill + decode pipeline, end-to-end. Diagrams mirror the dataflow in IMPLEMENTATION_GUIDE.html Part B1; numbers are reproduced from a single make profile run on NPU2 (AMD Strix), seq_len=2048, MODEL=instruct.
make profile reportsmake profile runs the same code path as make run — the production prefill + decode functions, end to end, real HuggingFace weights — and just enables the otherwise-disabled Profiler instance that cache.load_and_run already records into. There is no profile-only code path; any change to the production functions is automatically reflected in the profile.
The report (printed at the end of the run) opens with an architecture-aware dataflow summary (matches this page’s SVG order) and then dumps generic detail tables per phase (prefill / decode):
| Section | What it tells you |
|---|---|
| END-TO-END DATAFLOW (at the top) | Architecture-aware walkthrough: tokenize → eos_pad → embed → 16×(rms_gemms_rope + flash_attn + o_ffn + kv_cache_extract) → final_norm → lm_head_gemv. Each row tagged CPU/NPU/— with measured ms. Same ordering as the SVGs in Part A / Part B below. Also prints the one-time Preprocessing (prepare_runtime) wall as a reminder. |
| Wall-Time Attribution | How the total wall budget splits across NPU XRT calls, CPU host ops, and the layer-loop envelope (sanity check; remainder is python scheduling). |
| Per-Layer Execution | One row per layer for prefill; aggregated avg/min/max across tokens for decode. |
| NPU XRT Call Breakdown | Each multi-launch ELF’s wall time per invocation, plus call count. The granularity is one XRT run = one merged ELF (sub-launches inside the ELF stay opaque, since that’s how production dispatches them). |
| CPU Op Breakdown | Each tracked CPU host operation (tokenize, eos_pad, embed lookup, KV-cache extract, final RMSNorm, decode CPU attention). |
| Fine-Grained NPU Breakdown | Each XRT call further split into BO Write / NPU Run / BO Read (concept explained in Part C). |
| Per-Token Wall Trend (decode only) | Per-token layer-loop wall for token 1 / middle / last + min/max/avg + first→last drift. Lets you see whether per-token latency grows with KV-cache length (decode CPU attention is O(current_pos)). With a 2048-token prompt and 30 decode tokens the drift is typically <1%. |
Snapshot from the report (single run, instruct model, 30 decode tokens):
| Metric | Wall | Notes |
|---|---|---|
| TTFT (time-to-first-token, prefill end-to-end) | ~1.28 s | tokenize + EOS-pad + embed + 16×layer + final RMSNorm + LM head. Matches the vLLM / TGI / TRT-LLM TTFT metric (user-facing latency from request submit to first output token). 95% NPU-bound. Tokenize varies by prompt length; ~10 ms typical. |
| TPOT (per output token, steady-state decode) | ~92 ms (10.8 tok/s) | 16 layers × 4.95 ms each + 13.6 ms LM head + ~0.1 ms host wrappers. Slope vs token index is <1% over 30 tokens (KV cache grows by ~1.5% on a 2048-token prompt). |
Preprocessing (one-time, prepare_runtime) | ~7.6 s | Compile external kernels + pre-load weights into per-layer BOs. Happens once per process and is NOT included in TTFT. |
One inference’s prefill phase: prompt → first generated token. Each box shows the step, where it runs, and the measured wall time. The 16 layers are identical; one iteration is shown in the “decoder block” container.
| ELF | Launches | avg / call | BO Write | NPU Run | BO Read | BO MB written |
|---|---|---|---|---|---|---|
rms_gemms_rope | 6 stitched | 7.3 ms | 0.5 ms | 6.5 ms | 0.1 ms | 8.0 MB |
flash_attn (separate ELF) | 1 | 21.6 ms | 1.3 ms | 20.1 ms | 0.1 ms | 20.0 MB |
o_ffn | 8 stitched | 41.0 ms | 1.0 ms | 39.8 ms | 0.1 ms | 16.0 MB |
lm_head_gemv (prefill end) | 8 stitched | 13.6 ms | ~0 ms | 13.5 ms | ~0 ms | ~0 MB |
| Op | Count | avg | Total |
|---|---|---|---|
tokenize | 1 | ~10 ms | ~10 ms |
eos_pad | 1 | ~0 ms | ~0 ms |
embed_lookup | 1 | 5.8 ms | 5.8 ms |
kv_cache_extract | 16 | 1.1 ms | 17.6 ms |
final_rms_norm | 1 | 3.1 ms | 3.1 ms |
| Total CPU | 20 | — | ~37 ms |
Wall-time attribution check: NPU XRT 1119 ms (16 layer-invocations × 3 kernels + 1 LM head = 49 calls) + CPU host ~37 ms = ~1156 ms accounted, vs. TTFT ~1280 ms → ~125 ms unattributed python/numpy scheduling, mostly inside the layer loop.
Per-token decode step: takes the last produced token, returns the next. Diagram and numbers cover one token; the loop repeats until EOT. Each kernel reflects an avg over 30 decode tokens, 16 layers.
| ELF | Launches | avg / call | BO Write | NPU Run | BO Read |
|---|---|---|---|---|---|
rms_gemv_rope | 6 stitched | 0.9 ms | 0.02 ms | 0.83 ms | 0.01 ms |
o_gemv_ffn | 8 stitched | 3.7 ms | 0.02 ms | 3.64 ms | 0.01 ms |
lm_head_gemv | 8 stitched | 13.6 ms | 0.01 ms | 13.50 ms | 0.03 ms |
| Op | Count / token | avg | Total / token |
|---|---|---|---|
decode_attention_cpu | 16 | 0.28 ms | 4.5 ms |
embed_lookup | 1 | ~0 ms | ~0 ms |
final_rms_norm | 1 | 0.07 ms | 0.07 ms |
| Total CPU / token | 18 | — | ~4.6 ms |
Wall-time check: NPU XRT per token = 16 × (0.9 + 3.7) + 13.6 = 87.2 ms · CPU = 4.6 ms · sum 91.8 ms ≈ observed 92 ms wall. Decode is overwhelmingly NPU-bound; the LM head GEMV alone is ~15% of the per-token cost.
Observation: across decode, BO Write is <1% — this is the payoff for pre-loading all weights into per-layer BOs (and marking them static_input_indices) during prepare_runtime. Without that, each layer would re-write its 116 MB of weights per token.
Each cache.load_and_run("kernel", backend, arg0, ..., argN) invocation is split into three timed segments:
t_write_msFor each input/intermediate argument that needs new bytes, the host does memcpy(numpy_data → BO.map()). Args marked static_input_indices (e.g. layer weights) skip this step on every call after prepare_runtime, so steady-state t_write_ms mainly reflects the dynamic inputs that change call-to-call (the input activation, RoPE LUT row, KV-cache slice, …).
What this measures in practice: host-to-DDR memcpy bandwidth for the dynamic inputs only. If you see this rise, either an argument lost its static_input_indices mark, or a normally-small dynamic input grew (e.g. a bigger seq_len).
t_kernel_msWall time of xrt.run.start() + xrt.run.wait(). This is the NPU actually executing the multi-launch ELF: DDR → L2/L1 DMAs, AIE-tile compute, and L1/L2 → DDR DMAs of outputs. Host does nothing here except spin-wait the completion signal.
What this measures: real NPU hardware execution time for the ELF. All the multi-launch’s stitched sub-launches (e.g. RMSNorm + Q + K + V + RoPE_Q + RoPE_K inside rms_gemms_rope.elf) run sequentially on-device and are not separately resolved here — that’s by design, because production never dispatches them separately.
t_read_msFor each output argument, the host constructs a numpy view over the BO’s mapped memory using np.frombuffer(BO.map(), …). This is zero-copy — no memcpy — and consistently <0.1 ms. If t_read_ms ever climbs into the ms range, that signals an accidental copy was introduced (e.g. an .astype() on a large output).
| Phase | BO Write | NPU Run | BO Read |
|---|---|---|---|
| Prefill (one full pass) | ~46 ms (4%) | ~1062 ms (95%) | ~5 ms (0%) |
| Decode (per token) | ~0.6 ms (1%) | ~86 ms (98%) | ~0.3 ms (0%) |
Both phases are dominated by NPU Run — the host’s job is mostly to feed the right BOs and wait. Decode is even closer to pure-NPU because the per-token dynamic inputs are tiny (a single activation row vs. an entire sequence’s worth).
cd programming_examples/llama32_1b
# One-time kernel compilation (~3-4 min, cached)
make compile
# Full profiling report (single run, instruct model)
make profile N_TOKENS=30 PROMPT="Explain photosynthesis in detail."
# Or with the base checkpoint
make profile MODEL=base N_TOKENS=30 PROMPT="Once upon a time"
The report is printed to stdout at the end of the run. To save a copy:
make profile 2>&1 | tee profile_$(date +%Y%m%d-%H%M%S).log
Numbers will jitter ±3-5% between runs (NPU power state, OS scheduling, etc); the breakdown structure is stable. make verify is the orthogonal gate that ensures the production code path producing these numbers is still numerically correct.
Companion: profile.md (textual perf summary, optimization history, vs IRON comparison) ·
IMPLEMENTATION_GUIDE.html B1 (same dataflow, no timing — shows just the structural picture).