122 lines
3.6 KiB
Markdown
122 lines
3.6 KiB
Markdown
# AMD Strix Halo — llama.cpp Toolboxes (Benchmarks)
|
||
|
||
**Live results:** [https://kyuz0.github.io/amd-strix-halo-toolboxes/](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
|
||
Filter by model name, size, and quantization; select backends with or without **Flash Attention (FA)**; compare pp512 and tg128 side-by-side; winners are computed with an error-aware tolerance rule.
|
||
|
||
---
|
||
|
||
## Benchmark methodology
|
||
|
||
* **pp512** — prompt processing throughput (tokens/sec)
|
||
* **tg128** — text generation throughput (tokens/sec)
|
||
* Each backend tested twice:
|
||
|
||
* FA off: `-fa 0`
|
||
* FA on: `-fa 1`
|
||
* Winners determined per model using pooled ± error from both results; multiple winners are possible.
|
||
|
||
Tested backends:
|
||
|
||
* Vulkan RADV
|
||
* Vulkan AMDVLK
|
||
* ROCm 6.4.2
|
||
* ROCm 6.4.2 + rocWMMA
|
||
* ROCm 7.x (beta / rc)
|
||
|
||
All runs built from the same llama.cpp commit.
|
||
|
||
---
|
||
|
||
## Running benchmarks
|
||
|
||
Place `.gguf` models in `models/` (for sharded models, include only the first shard: `*-00001-of-*.gguf`).
|
||
Run:
|
||
|
||
```bash
|
||
benchmark/run_benchmarks.sh
|
||
```
|
||
|
||
This will:
|
||
|
||
* Detect models
|
||
* Execute each backend twice (FA off / FA on)
|
||
* Save logs in `benchmark/results/`
|
||
|
||
Generate `results.json` for analysis:
|
||
|
||
```bash
|
||
python benchmark/parse_results_to_json.py
|
||
```
|
||
|
||
Optional: print summary statistics:
|
||
|
||
```bash
|
||
python benchmark/summarize_results.py
|
||
```
|
||
|
||
---
|
||
|
||
## Summary of current dataset
|
||
|
||
### pp512 (prompt processing)
|
||
|
||
* **Vulkan AMDVLK** leads in average throughput and most frequent wins.
|
||
|
||
* Winner count: AMDVLK (FA on) – 11 models; AMDVLK (FA off) – 3 models.
|
||
* Average t/s: AMDVLK (FA off) – 422.46; AMDVLK (FA on) – 388.68.
|
||
* **Vulkan RADV** is competitive and shows wins on multiple models.
|
||
|
||
* Winner count: RADV (FA on) – 3 models.
|
||
* Average t/s: RADV (FA on) – 279.95; RADV (FA off) – 273.54.
|
||
* **ROCm 6.4.2 + rocWMMA** is strong in some cases.
|
||
|
||
* Winner count: 2 models (FA on).
|
||
* Average t/s: rocWMMA (FA on) – 335.44.
|
||
* ROCm 7.x variants trail in pp512 averages.
|
||
|
||
**Conclusion:** AMDVLK is generally fastest for prompt processing. RADV is close on certain models and is less prone to instability. ROCm+rocWMMA can match or exceed in select cases but is inconsistent.
|
||
|
||
---
|
||
|
||
### tg128 (text generation)
|
||
|
||
* **Vulkan RADV** shows the most frequent wins.
|
||
|
||
* Winner count: RADV (FA off) – 6 models; RADV (FA on) – 5 models.
|
||
* Average t/s: RADV (FA off) – 23.73; RADV (FA on) – 23.45.
|
||
* **Vulkan AMDVLK** wins in some cases but is less dominant than in pp512.
|
||
|
||
* Winner count: AMDVLK (FA off) – 4 models.
|
||
* Average t/s: AMDVLK (FA off) – 25.91; AMDVLK (FA on) – 23.85.
|
||
* **ROCm 6.4.2 + rocWMMA** achieves the highest average t/s.
|
||
|
||
* Average t/s: rocWMMA (FA on) – 32.51; rocWMMA (FA off) – 31.96.
|
||
* ROCm 7.x and ROCm 6.4.2 also appear among winners in several models.
|
||
|
||
**Conclusion:** RADV is the most consistent for text generation wins. ROCm+rocWMMA delivers the highest averages but with potential stability issues. AMDVLK is competitive but not consistently ahead.
|
||
|
||
---
|
||
|
||
## Flash Attention (FA)
|
||
|
||
FA effects vary:
|
||
|
||
* In pp512 averages, AMDVLK performs better without FA.
|
||
* In tg128, the effect depends on backend and model.
|
||
FA should be treated as a per-model tuning parameter rather than enabled or disabled globally.
|
||
|
||
---
|
||
|
||
## Recommendations
|
||
|
||
* **Stability priority:** Vulkan RADV.
|
||
* **Maximum pp512 throughput:** Vulkan AMDVLK, validate per model.
|
||
* **High tg128 averages:** ROCm 6.4.2 + rocWMMA, test stability.
|
||
* **FA setting:** Evaluate per model/backend using side-by-side comparison.
|
||
|
||
---
|
||
|
||
## Winner calculation
|
||
|
||
A backend is a winner if its mean throughput is within the best backend’s pooled ± error margin for that model and test type.
|