- Included rocm-7rc-rocwmma toolbox - Included updated results from benchmarks including rocm 7rc with ROMWMMA and hipBLASLt
4.2 KiB
AMD Strix Halo — llama.cpp Toolboxes (Benchmarks)
Interactive results: https://kyuz0.github.io/amd-strix-halo-toolboxes/
- Filter by model name, size, and quantization
- Select backends with or without Flash Attention
- Compare pp512 and tg128 side-by-side
- Winners are computed using an error-aware tolerance rule — if two results overlap within their ± error margins, both are counted as winners.
Benchmark methodology
-
pp512 — prompt processing throughput (tokens/sec, prefill)
-
tg128 — token generation throughput (tokens/sec, interactive)
-
Each backend tested twice per model:
- Flash Attention OFF:
-fa 0 - Flash Attention ON:
-fa 1
- Flash Attention OFF:
-
Winners are determined per model using pooled ± error from all relevant runs; multiple winners are possible.
-
All runs were built from the same
llama.cppcommit for consistency.
Tested backends:
- Vulkan RADV
- Vulkan AMDVLK
- ROCm 6.4.2
- ROCm 6.4.2 + ROCWMMA
- ROCm 7.x (beta / RC)
- ROCm 7.x + ROCWMMA + hipBLASLt
Note on ROCm 7 hipBLASLt:
All ROCm 7 toolboxes ship with hipBLASLt enabled by default (ROCBLAS_USE_HIPBLASLT=1) because it improves performance and stability in most cases.
However, the benchmark script also includes runs with hipBLASLt disabled (-hblt0) so we can measure the impact directly.
Running benchmarks
Place .gguf models in models/ (for sharded models, include only the first shard: *-00001-of-*.gguf).
Run:
benchmark/run_benchmarks.sh
This will:
- Detect models
- Execute each backend twice (FA off / FA on)
- Save logs in
benchmark/results/
Generate results.json for analysis:
python benchmark/parse_results_to_json.py
Optional: print summary statistics:
python benchmark/summarize_results.py
Summary of current dataset (margin-aware, Flash Attention ON)
Prompt Processing (pp512)
- ROCm 7 RC + ROCWMMA + hipBLASLt dominates — 15 wins/ties out of 22 models.
- Vulkan AMDVLK is second most frequent winner (4 wins/ties) but can’t load certain architectures due to the ≤ 2 GiB single-buffer limit.
- Vulkan RADV rarely wins in PP but is highly stable.
Token Generation (tg128)
- Vulkan RADV leads — 13 wins/ties out of 15 possible.
- Vulkan AMDVLK is a strong second, usually just behind RADV in TG.
- ROCm 7 RC + ROCWMMA + hipBLASLt generally lags in TG but still posts competitive results for some models.
Placement counts (margin-aware, Flash Attention ON)
Prompt Processing (pp512)
| Backend | 1st | 2nd | 3rd |
|---|---|---|---|
| ROCm 7 RC + ROCWMMA + hipBLASLt | 15 | 2 | 1 |
| Vulkan AMDVLK | 4 | 5 | 1 |
| Vulkan RADV | 0 | 2 | 2 |
Token Generation (tg128)
| Backend | 1st | 2nd | 3rd |
|---|---|---|---|
| Vulkan RADV | 13 | 1 | 1 |
| Vulkan AMDVLK | 1 | 10 | 1 |
| ROCm 7 RC + ROCWMMA + hipBLASLt | 1 | 1 | 6 |
Flash Attention
- ROCm 7 RC + ROCWMMA + hipBLASLt benefits noticeably from Flash Attention ON in prompt processing, with no stability penalties recorded.
- Vulkan AMDVLK and Vulkan RADV show mixed changes — some models improve with FA, others slow down slightly.
- FA should be enabled or disabled per model/backend based on measured performance.
Recommendations
- Fastest prompt processing: ROCm 7 RC + ROCWMMA + hipBLASLt (Flash Attention ON)
- Fastest token generation: Vulkan RADV (Flash Attention ON)
- Balanced performance: Vulkan AMDVLK (fast PP & decent TG, but ≤ 2 GiB buffer limit)
- BF16 models: ROCm 7 RC + ROCWMMA + hipBLASLt (best ROCm PP/TG combo, stable with FA ON)
- Maximum stability: Vulkan RADV
Winner calculation
A backend is counted as a winner if its mean throughput is within the best backend’s pooled ± error margin for that model/test type. This ensures results within measurement noise are treated as ties, not false losses.