3.6 KiB
AMD Strix Halo — llama.cpp Toolboxes (Benchmarks)
Live results: https://kyuz0.github.io/amd-strix-halo-toolboxes/ Filter by model name, size, and quantization; select backends with or without Flash Attention (FA); compare pp512 and tg128 side-by-side; winners are computed with an error-aware tolerance rule.
Benchmark methodology
-
pp512 — prompt processing throughput (tokens/sec)
-
tg128 — text generation throughput (tokens/sec)
-
Each backend tested twice:
- FA off:
-fa 0 - FA on:
-fa 1
- FA off:
-
Winners determined per model using pooled ± error from both results; multiple winners are possible.
Tested backends:
- Vulkan RADV
- Vulkan AMDVLK
- ROCm 6.4.2
- ROCm 6.4.2 + rocWMMA
- ROCm 7.x (beta / rc)
All runs built from the same llama.cpp commit.
Running benchmarks
Place .gguf models in models/ (for sharded models, include only the first shard: *-00001-of-*.gguf).
Run:
benchmark/run_benchmarks.sh
This will:
- Detect models
- Execute each backend twice (FA off / FA on)
- Save logs in
benchmark/results/
Generate results.json for analysis:
python benchmark/parse_results_to_json.py
Optional: print summary statistics:
python benchmark/summarize_results.py
Summary of current dataset
pp512 (prompt processing)
-
Vulkan AMDVLK leads in average throughput and most frequent wins.
- Winner count: AMDVLK (FA on) – 11 models; AMDVLK (FA off) – 3 models.
- Average t/s: AMDVLK (FA off) – 422.46; AMDVLK (FA on) – 388.68.
-
Vulkan RADV is competitive and shows wins on multiple models.
- Winner count: RADV (FA on) – 3 models.
- Average t/s: RADV (FA on) – 279.95; RADV (FA off) – 273.54.
-
ROCm 6.4.2 + rocWMMA is strong in some cases.
- Winner count: 2 models (FA on).
- Average t/s: rocWMMA (FA on) – 335.44.
-
ROCm 7.x variants trail in pp512 averages.
Conclusion: AMDVLK is generally fastest for prompt processing. RADV is close on certain models and is less prone to instability. ROCm+rocWMMA can match or exceed in select cases but is inconsistent.
tg128 (text generation)
-
Vulkan RADV shows the most frequent wins.
- Winner count: RADV (FA off) – 6 models; RADV (FA on) – 5 models.
- Average t/s: RADV (FA off) – 23.73; RADV (FA on) – 23.45.
-
Vulkan AMDVLK wins in some cases but is less dominant than in pp512.
- Winner count: AMDVLK (FA off) – 4 models.
- Average t/s: AMDVLK (FA off) – 25.91; AMDVLK (FA on) – 23.85.
-
ROCm 6.4.2 + rocWMMA achieves the highest average t/s.
- Average t/s: rocWMMA (FA on) – 32.51; rocWMMA (FA off) – 31.96.
-
ROCm 7.x and ROCm 6.4.2 also appear among winners in several models.
Conclusion: RADV is the most consistent for text generation wins. ROCm+rocWMMA delivers the highest averages but with potential stability issues. AMDVLK is competitive but not consistently ahead.
Flash Attention (FA)
FA effects vary:
- In pp512 averages, AMDVLK performs better without FA.
- In tg128, the effect depends on backend and model. FA should be treated as a per-model tuning parameter rather than enabled or disabled globally.
Recommendations
- Stability priority: Vulkan RADV.
- Maximum pp512 throughput: Vulkan AMDVLK, validate per model.
- High tg128 averages: ROCm 6.4.2 + rocWMMA, test stability.
- FA setting: Evaluate per model/backend using side-by-side comparison.
Winner calculation
A backend is a winner if its mean throughput is within the best backend’s pooled ± error margin for that model and test type.