Files
amd-strix-halo-toolboxes/docs/benchmarks.md
T
Donato Capitella a9618d881b - Corrected typo in WMMA (was spelt wrong as waam)
- Included rocm-7rc-rocwmma toolbox
- Included updated results from benchmarks including rocm 7rc with ROMWMMA and hipBLASLt
2025-08-10 13:21:06 +01:00

4.2 KiB
Raw Blame History

AMD Strix Halo — llama.cpp Toolboxes (Benchmarks)

Interactive results: https://kyuz0.github.io/amd-strix-halo-toolboxes/

  • Filter by model name, size, and quantization
  • Select backends with or without Flash Attention
  • Compare pp512 and tg128 side-by-side
  • Winners are computed using an error-aware tolerance rule — if two results overlap within their ± error margins, both are counted as winners.

Benchmark methodology

  • pp512 — prompt processing throughput (tokens/sec, prefill)

  • tg128 — token generation throughput (tokens/sec, interactive)

  • Each backend tested twice per model:

    • Flash Attention OFF: -fa 0
    • Flash Attention ON: -fa 1
  • Winners are determined per model using pooled ± error from all relevant runs; multiple winners are possible.

  • All runs were built from the same llama.cpp commit for consistency.

Tested backends:

  • Vulkan RADV
  • Vulkan AMDVLK
  • ROCm 6.4.2
  • ROCm 6.4.2 + ROCWMMA
  • ROCm 7.x (beta / RC)
  • ROCm 7.x + ROCWMMA + hipBLASLt

Note on ROCm 7 hipBLASLt: All ROCm 7 toolboxes ship with hipBLASLt enabled by default (ROCBLAS_USE_HIPBLASLT=1) because it improves performance and stability in most cases. However, the benchmark script also includes runs with hipBLASLt disabled (-hblt0) so we can measure the impact directly.


Running benchmarks

Place .gguf models in models/ (for sharded models, include only the first shard: *-00001-of-*.gguf).

Run:

benchmark/run_benchmarks.sh

This will:

  • Detect models
  • Execute each backend twice (FA off / FA on)
  • Save logs in benchmark/results/

Generate results.json for analysis:

python benchmark/parse_results_to_json.py

Optional: print summary statistics:

python benchmark/summarize_results.py

Summary of current dataset (margin-aware, Flash Attention ON)

Prompt Processing (pp512)

  • ROCm 7 RC + ROCWMMA + hipBLASLt dominates — 15 wins/ties out of 22 models.
  • Vulkan AMDVLK is second most frequent winner (4 wins/ties) but cant load certain architectures due to the ≤ 2 GiB single-buffer limit.
  • Vulkan RADV rarely wins in PP but is highly stable.

Token Generation (tg128)

  • Vulkan RADV leads — 13 wins/ties out of 15 possible.
  • Vulkan AMDVLK is a strong second, usually just behind RADV in TG.
  • ROCm 7 RC + ROCWMMA + hipBLASLt generally lags in TG but still posts competitive results for some models.

Placement counts (margin-aware, Flash Attention ON)

Prompt Processing (pp512)

Backend 1st 2nd 3rd
ROCm 7 RC + ROCWMMA + hipBLASLt 15 2 1
Vulkan AMDVLK 4 5 1
Vulkan RADV 0 2 2

Token Generation (tg128)

Backend 1st 2nd 3rd
Vulkan RADV 13 1 1
Vulkan AMDVLK 1 10 1
ROCm 7 RC + ROCWMMA + hipBLASLt 1 1 6

Flash Attention

  • ROCm 7 RC + ROCWMMA + hipBLASLt benefits noticeably from Flash Attention ON in prompt processing, with no stability penalties recorded.
  • Vulkan AMDVLK and Vulkan RADV show mixed changes — some models improve with FA, others slow down slightly.
  • FA should be enabled or disabled per model/backend based on measured performance.

Recommendations

  • Fastest prompt processing: ROCm 7 RC + ROCWMMA + hipBLASLt (Flash Attention ON)
  • Fastest token generation: Vulkan RADV (Flash Attention ON)
  • Balanced performance: Vulkan AMDVLK (fast PP & decent TG, but ≤ 2 GiB buffer limit)
  • BF16 models: ROCm 7 RC + ROCWMMA + hipBLASLt (best ROCm PP/TG combo, stable with FA ON)
  • Maximum stability: Vulkan RADV

Winner calculation

A backend is counted as a winner if its mean throughput is within the best backends pooled ± error margin for that model/test type. This ensures results within measurement noise are treated as ties, not false losses.