Files
amd-strix-halo-toolboxes/docs/benchmarks.md
T
Donato Capitella a9618d881b - Corrected typo in WMMA (was spelt wrong as waam)
- Included rocm-7rc-rocwmma toolbox
- Included updated results from benchmarks including rocm 7rc with ROMWMMA and hipBLASLt
2025-08-10 13:21:06 +01:00

125 lines
4.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AMD Strix Halo — llama.cpp Toolboxes (Benchmarks)
**Interactive results:** [https://kyuz0.github.io/amd-strix-halo-toolboxes/](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
* Filter by model name, size, and quantization
* Select backends with or without **Flash Attention**
* Compare pp512 and tg128 side-by-side
* Winners are computed using an **error-aware tolerance rule** — if two results overlap within their ± error margins, both are counted as winners.
---
## Benchmark methodology
* **pp512** — prompt processing throughput (tokens/sec, prefill)
* **tg128** — token generation throughput (tokens/sec, interactive)
* Each backend tested twice per model:
* **Flash Attention OFF:** `-fa 0`
* **Flash Attention ON:** `-fa 1`
* Winners are determined per model using pooled ± error from all relevant runs; multiple winners are possible.
* All runs were built from the same `llama.cpp` commit for consistency.
**Tested backends:**
* Vulkan RADV
* Vulkan AMDVLK
* ROCm 6.4.2
* ROCm 6.4.2 + ROCWMMA
* ROCm 7.x (beta / RC)
* ROCm 7.x + ROCWMMA + hipBLASLt
**Note on ROCm 7 hipBLASLt:**
All ROCm 7 toolboxes ship with **hipBLASLt enabled by default** (`ROCBLAS_USE_HIPBLASLT=1`) because it improves performance and stability in most cases.
However, the benchmark script also includes runs with **hipBLASLt disabled** (`-hblt0`) so we can measure the impact directly.
---
## Running benchmarks
Place `.gguf` models in `models/` (for sharded models, include only the first shard: `*-00001-of-*.gguf`).
Run:
```bash
benchmark/run_benchmarks.sh
```
This will:
* Detect models
* Execute each backend twice (FA off / FA on)
* Save logs in `benchmark/results/`
Generate `results.json` for analysis:
```bash
python benchmark/parse_results_to_json.py
```
Optional: print summary statistics:
```bash
python benchmark/summarize_results.py
```
---
## Summary of current dataset (margin-aware, Flash Attention ON)
### Prompt Processing (pp512)
* **ROCm 7 RC + ROCWMMA + hipBLASLt** dominates — **15 wins/ties** out of 22 models.
* **Vulkan AMDVLK** is second most frequent winner (**4 wins/ties**) but cant load certain architectures due to the ≤ 2 GiB single-buffer limit.
* **Vulkan RADV** rarely wins in PP but is highly stable.
### Token Generation (tg128)
* **Vulkan RADV** leads — **13 wins/ties** out of 15 possible.
* **Vulkan AMDVLK** is a strong second, usually just behind RADV in TG.
* **ROCm 7 RC + ROCWMMA + hipBLASLt** generally lags in TG but still posts competitive results for some models.
---
### Placement counts (margin-aware, Flash Attention ON)
**Prompt Processing (pp512)**
| Backend | 1st | 2nd | 3rd |
| ------------------------------- | -----: | --: | --: |
| ROCm 7 RC + ROCWMMA + hipBLASLt | **15** | 2 | 1 |
| Vulkan AMDVLK | 4 | 5 | 1 |
| Vulkan RADV | 0 | 2 | 2 |
**Token Generation (tg128)**
| Backend | 1st | 2nd | 3rd |
| ------------------------------- | -----: | --: | --: |
| Vulkan RADV | **13** | 1 | 1 |
| Vulkan AMDVLK | 1 | 10 | 1 |
| ROCm 7 RC + ROCWMMA + hipBLASLt | 1 | 1 | 6 |
---
## Flash Attention
* **ROCm 7 RC + ROCWMMA + hipBLASLt** benefits noticeably from Flash Attention ON in prompt processing, with no stability penalties recorded.
* **Vulkan AMDVLK** and **Vulkan RADV** show mixed changes — some models improve with FA, others slow down slightly.
* FA should be enabled or disabled **per model/backend** based on measured performance.
---
## Recommendations
* **Fastest prompt processing:** ROCm 7 RC + ROCWMMA + hipBLASLt (Flash Attention ON)
* **Fastest token generation:** Vulkan RADV (Flash Attention ON)
* **Balanced performance:** Vulkan AMDVLK (fast PP & decent TG, but ≤ 2 GiB buffer limit)
* **BF16 models:** ROCm 7 RC + ROCWMMA + hipBLASLt (best ROCm PP/TG combo, stable with FA ON)
* **Maximum stability:** Vulkan RADV
---
## Winner calculation
A backend is counted as a winner if its mean throughput is within the best backends pooled ± error margin for that model/test type. This ensures results within measurement noise are treated as ties, not false losses.