Files

T

Donato Capitella a9618d881b - Corrected typo in WMMA (was spelt wrong as waam)

- Included rocm-7rc-rocwmma toolbox
- Included updated results from benchmarks including rocm 7rc with ROMWMMA and hipBLASLt

2025-08-10 13:21:06 +01:00

4.2 KiB

Raw Blame History

AMD Strix Halo — llama.cpp Toolboxes (Benchmarks)

Interactive results: https://kyuz0.github.io/amd-strix-halo-toolboxes/

Filter by model name, size, and quantization
Select backends with or without Flash Attention
Compare pp512 and tg128 side-by-side
Winners are computed using an error-aware tolerance rule — if two results overlap within their ± error margins, both are counted as winners.

Benchmark methodology

pp512 — prompt processing throughput (tokens/sec, prefill)
tg128 — token generation throughput (tokens/sec, interactive)
Each backend tested twice per model:
- Flash Attention OFF: -fa 0
- Flash Attention ON: -fa 1
Winners are determined per model using pooled ± error from all relevant runs; multiple winners are possible.
All runs were built from the same llama.cpp commit for consistency.

Tested backends:

Vulkan RADV
Vulkan AMDVLK
ROCm 6.4.2
ROCm 6.4.2 + ROCWMMA
ROCm 7.x (beta / RC)
ROCm 7.x + ROCWMMA + hipBLASLt

Note on ROCm 7 hipBLASLt: All ROCm 7 toolboxes ship with hipBLASLt enabled by default (ROCBLAS_USE_HIPBLASLT=1) because it improves performance and stability in most cases. However, the benchmark script also includes runs with hipBLASLt disabled (-hblt0) so we can measure the impact directly.

Running benchmarks

Place .gguf models in models/ (for sharded models, include only the first shard: *-00001-of-*.gguf).

Run:

benchmark/run_benchmarks.sh

This will:

Detect models
Execute each backend twice (FA off / FA on)
Save logs in benchmark/results/

Generate results.json for analysis:

python benchmark/parse_results_to_json.py

Optional: print summary statistics:

python benchmark/summarize_results.py

Summary of current dataset (margin-aware, Flash Attention ON)

Prompt Processing (pp512)

ROCm 7 RC + ROCWMMA + hipBLASLt dominates — 15 wins/ties out of 22 models.
Vulkan AMDVLK is second most frequent winner (4 wins/ties) but can’t load certain architectures due to the ≤ 2 GiB single-buffer limit.
Vulkan RADV rarely wins in PP but is highly stable.

Token Generation (tg128)

Vulkan RADV leads — 13 wins/ties out of 15 possible.
Vulkan AMDVLK is a strong second, usually just behind RADV in TG.
ROCm 7 RC + ROCWMMA + hipBLASLt generally lags in TG but still posts competitive results for some models.

Placement counts (margin-aware, Flash Attention ON)

Prompt Processing (pp512)

Backend	1st	2nd	3rd
ROCm 7 RC + ROCWMMA + hipBLASLt	15	2	1
Vulkan AMDVLK	4	5	1
Vulkan RADV	0	2	2

Token Generation (tg128)

Backend	1st	2nd	3rd
Vulkan RADV	13	1	1
Vulkan AMDVLK	1	10	1
ROCm 7 RC + ROCWMMA + hipBLASLt	1	1	6

Flash Attention

ROCm 7 RC + ROCWMMA + hipBLASLt benefits noticeably from Flash Attention ON in prompt processing, with no stability penalties recorded.
Vulkan AMDVLK and Vulkan RADV show mixed changes — some models improve with FA, others slow down slightly.
FA should be enabled or disabled per model/backend based on measured performance.

Recommendations

Fastest prompt processing: ROCm 7 RC + ROCWMMA + hipBLASLt (Flash Attention ON)
Fastest token generation: Vulkan RADV (Flash Attention ON)
Balanced performance: Vulkan AMDVLK (fast PP & decent TG, but ≤ 2 GiB buffer limit)
BF16 models: ROCm 7 RC + ROCWMMA + hipBLASLt (best ROCm PP/TG combo, stable with FA ON)
Maximum stability: Vulkan RADV

Winner calculation

A backend is counted as a winner if its mean throughput is within the best backend’s pooled ± error margin for that model/test type. This ensures results within measurement noise are treated as ties, not false losses.

4.2 KiB Raw Blame History Unescape Escape