Updated benchmakrs, removed old toolboxes and results

2025-08-17 12:32:08 +01:00
parent 62e5080102
commit b71a37647f
130 changed files with 733 additions and 14425 deletions
@@ -1,124 +1,158 @@
 # AMD Strix Halo — llama.cpp Toolboxes (Benchmarks)

-**Interactive results:** [https://kyuz0.github.io/amd-strix-halo-toolboxes/](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
+**Interactive results:** https://kyuz0.github.io/amd-strix-halo-toolboxes/

-* Filter by model name, size, and quantization
-* Select backends with or without **Flash Attention**
-* Compare pp512 and tg128 side-by-side
-* Winners are computed using an **error-aware tolerance rule** — if two results overlap within their ± error margins, both are counted as winners.
+## Table of Contents
+- [Benchmark methodology](#benchmark-methodology)
+- [Summary of current dataset (Flash Attention ON)](#summary-of-current-dataset-flash-attention-on)
+  - [Placement counts](#placement-counts)
+  - [Pairwise head-to-head wins](#pairwise-head-to-head-wins)
+  - [Average ranks](#average-ranks)
+- [Analyses by feature](#analyses-by-feature)
+  - [Impact of Flash Attention](#impact-of-flash-attention)
+  - [Impact of ROCWMMA](#impact-of-rocwmma)
+  - [Impact of hipBLASLt](#impact-of-hipblaslt)
+  - [Vulkan: AMDVLK vs RADV](#vulkan-amdvlk-vs-radv)
+- [Recommendations](#recommendations)
+- [Winner calculation](#winner-calculation)

 ---

 ## Benchmark methodology

-* **pp512** — prompt processing throughput (tokens/sec, prefill)
-* **tg128** — token generation throughput (tokens/sec, interactive)
-* Each backend tested twice per model:
+- **pp512** — prompt processing throughput (tokens/sec, prefill)
+- **tg128** — token generation throughput (tokens/sec, interactive)
+- Each backend tested twice per model: `-fa 0` and `-fa 1`
+- Winners per model/test are **margin-aware**; multiple winners are possible when mean±σ overlap
+- Built from the same llama.cpp commit for consistency

-  * **Flash Attention OFF:** `-fa 0`
-  * **Flash Attention ON:**  `-fa 1`
-* Winners are determined per model using pooled ± error from all relevant runs; multiple winners are possible.
-* All runs were built from the same `llama.cpp` commit for consistency.
+**Backends in this dataset:** ROCm 7 RC + ROCWMMA + hipBLASLt, ROCm 7 RC (hipBLASLt), ROCm 7 RC (hipBLASLt OFF), ROCm 7 RC + ROCWMMA (hipBLASLt OFF), ROCm 6.4.3 (hipBLASLt), ROCm 6.4.3 (hipBLASLt OFF), ROCm 6.4.3 + ROCWMMA (hipBLASLt), ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF), Vulkan AMDVLK, Vulkan RADV

-**Tested backends:**
-
-* Vulkan RADV
-* Vulkan AMDVLK
-* ROCm 6.4.2
-* ROCm 6.4.2 + ROCWMMA
-* ROCm 7.x (beta / RC)
-* ROCm 7.x + ROCWMMA + hipBLASLt
-
-**Note on ROCm 7 hipBLASLt:**
-All ROCm 7 toolboxes ship with **hipBLASLt enabled by default** (`ROCBLAS_USE_HIPBLASLT=1`) because it improves performance and stability in most cases.
-However, the benchmark script also includes runs with **hipBLASLt disabled** (`-hblt0`) so we can measure the impact directly.
+**ROCm 7 hipBLASLt policy:** Toolboxes ship with **hipBLASLt enabled** by default (`ROCBLAS_USE_HIPBLASLT=1`). The benchmark script also runs **hipBLASLt OFF** variants (`-hblt0`) to measure its effect.

 ---

-## Running benchmarks
-
-Place `.gguf` models in `models/` (for sharded models, include only the first shard: `*-00001-of-*.gguf`).
-
-Run:
-
-```bash
-benchmark/run_benchmarks.sh
-```
-
-This will:
-
-* Detect models
-* Execute each backend twice (FA off / FA on)
-* Save logs in `benchmark/results/`
-
-Generate `results.json` for analysis:
-
-```bash
-python benchmark/parse_results_to_json.py
-```
-
-Optional: print summary statistics:
-
-```bash
-python benchmark/summarize_results.py
-```
-
---
-
-## Summary of current dataset (margin-aware, Flash Attention ON)
-
-### Prompt Processing (pp512)
-
-* **ROCm 7 RC + ROCWMMA + hipBLASLt** dominates — **15 wins/ties** out of 22 models.
-* **Vulkan AMDVLK** is second most frequent winner (**4 wins/ties**) but can’t load certain architectures due to the ≤ 2 GiB single-buffer limit.
-* **Vulkan RADV** rarely wins in PP but is highly stable.
-
-### Token Generation (tg128)
-
-* **Vulkan RADV** leads — **13 wins/ties** out of 15 possible.
-* **Vulkan AMDVLK** is a strong second, usually just behind RADV in TG.
-* **ROCm 7 RC + ROCWMMA + hipBLASLt** generally lags in TG but still posts competitive results for some models.
-
---
-
-### Placement counts (margin-aware, Flash Attention ON)
+## Summary of current dataset (Flash Attention ON)

+### Placement counts
 **Prompt Processing (pp512)**
-
-| Backend                         |    1st | 2nd | 3rd |
-| ------------------------------- | -----: | --: | --: |
-| ROCm 7 RC + ROCWMMA + hipBLASLt | **15** |   2 |   1 |
-| Vulkan AMDVLK                   |      4 |   5 |   1 |
-| Vulkan RADV                     |      0 |   2 |   2 |
+| Backend | 1st | 2nd | 3rd |
+| --- | ---: | ---: | ---: |
+| ROCm 6.4.3 + ROCWMMA (hipBLASLt) | 9 | 5 | 0 |
+| ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 3 | 3 | 8 |
+| Vulkan AMDVLK | 3 | 0 | 2 |
+| ROCm 7 RC + ROCWMMA + hipBLASLt | 1 | 8 | 4 |
+| ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 0 | 0 | 1 |
+| Vulkan RADV | 0 | 0 | 1 |

 **Token Generation (tg128)**
+| Backend | 1st | 2nd | 3rd |
+| --- | ---: | ---: | ---: |
+| Vulkan RADV | 13 | 0 | 0 |
+| ROCm 6.4.3 (hipBLASLt) | 3 | 0 | 1 |
+| ROCm 6.4.3 + ROCWMMA (hipBLASLt) | 1 | 4 | 3 |
+| ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 1 | 2 | 4 |
+| ROCm 6.4.3 (hipBLASLt OFF) | 1 | 1 | 1 |
+| ROCm 7 RC (hipBLASLt OFF) | 1 | 1 | 1 |
+| ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 1 | 1 | 1 |
+| ROCm 7 RC (hipBLASLt) | 1 | 0 | 4 |
+| Vulkan AMDVLK | 0 | 10 | 0 |
+| ROCm 7 RC + ROCWMMA + hipBLASLt | 0 | 1 | 2 |

-| Backend                         |    1st | 2nd | 3rd |
-| ------------------------------- | -----: | --: | --: |
-| Vulkan RADV                     | **13** |   1 |   1 |
-| Vulkan AMDVLK                   |      1 |  10 |   1 |
-| ROCm 7 RC + ROCWMMA + hipBLASLt |      1 |   1 |   6 |
+### Pairwise head-to-head wins
+For any model+quant where both backends succeeded, this counts who was faster (ties when equal).
+| Comparison | Test | A wins | B wins | Ties | Total |
+| --- | --- | ---: | ---: | ---: | ---: |
+| ROCm 7 RC + ROCWMMA + hipBLASLt vs Vulkan AMDVLK | pp512 | 11 | 4 | 0 | 15 |
+| ROCm 7 RC + ROCWMMA + hipBLASLt vs Vulkan AMDVLK | tg128 | 4 | 10 | 1 | 15 |
+| ROCm 7 RC + ROCWMMA + hipBLASLt vs Vulkan RADV | pp512 | 14 | 2 | 0 | 16 |
+| ROCm 7 RC + ROCWMMA + hipBLASLt vs Vulkan RADV | tg128 | 3 | 13 | 0 | 16 |
+| Vulkan AMDVLK vs Vulkan RADV | pp512 | 13 | 2 | 0 | 15 |
+| Vulkan AMDVLK vs Vulkan RADV | tg128 | 2 | 13 | 0 | 15 |
+
+### Average ranks
+**Prompt Processing (pp512)**
+| Backend | Avg Rank (↓ is better) |
+| --- | ---: |
+| ROCm 6.4.3 + ROCWMMA (hipBLASLt) | 1.36 |
+| Vulkan AMDVLK | 1.8 |
+| ROCm 7 RC + ROCWMMA + hipBLASLt | 2.23 |
+| ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 2.36 |
+| ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 3.0 |
+| Vulkan RADV | 3.0 |
+
+**Token Generation (tg128)**
+| Backend | Avg Rank (↓ is better) |
+| --- | ---: |
+| Vulkan RADV | 1.0 |
+| ROCm 6.4.3 (hipBLASLt) | 1.5 |
+| Vulkan AMDVLK | 2.0 |
+| ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 2.0 |
+| ROCm 7 RC (hipBLASLt OFF) | 2.0 |
+| ROCm 6.4.3 (hipBLASLt OFF) | 2.0 |
+| ROCm 6.4.3 + ROCWMMA (hipBLASLt) | 2.25 |
+| ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 2.43 |
+| ROCm 7 RC (hipBLASLt) | 2.6 |
+| ROCm 7 RC + ROCWMMA + hipBLASLt | 2.67 |

 ---

-## Flash Attention
+## Analyses by feature

-* **ROCm 7 RC + ROCWMMA + hipBLASLt** benefits noticeably from Flash Attention ON in prompt processing, with no stability penalties recorded.
-* **Vulkan AMDVLK** and **Vulkan RADV** show mixed changes — some models improve with FA, others slow down slightly.
-* FA should be enabled or disabled **per model/backend** based on measured performance.
+### Impact of Flash Attention
+Median % change when **Flash Attention ON vs OFF**, paired by model+quant, per backend:
+| Backend | pp512 Δ% (median, min..max, n) | tg128 Δ% (median, min..max, n) |
+| --- | --- | --- |
+| ROCm 7 RC + ROCWMMA + hipBLASLt | 8.4% (3.6..65.6), n=14 | -1.1% (-8.2..-0.3), n=14 |
+| ROCm 7 RC (hipBLASLt) | -20.2% (-27.8..6.5), n=10 | -1.4% (-8.5..3.0), n=10 |
+| ROCm 7 RC (hipBLASLt OFF) | -20.4% (-28.2..-16.1), n=9 | -1.9% (-8.6..0.1), n=9 |
+| ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 5.8% (1.3..24.1), n=16 | -1.1% (-7.4..15.1), n=16 |
+| ROCm 6.4.3 (hipBLASLt) | -19.5% (-25.7..-11.9), n=12 | -1.2% (-6.9..0.8), n=12 |
+| ROCm 6.4.3 (hipBLASLt OFF) | -10.3% (-22.3..3.6), n=9 | -1.6% (-11.1..0.0), n=9 |
+| ROCm 6.4.3 + ROCWMMA (hipBLASLt) | 10.9% (3.9..25.7), n=15 | -0.4% (-7.5..3.0), n=15 |
+| ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 6.4% (1.8..12.3), n=10 | -0.6% (-6.5..2.3), n=10 |
+| Vulkan AMDVLK | 1.1% (-45.4..20.2), n=15 | -1.5% (-28.6..0.1), n=15 |
+| Vulkan RADV | 3.4% (-2.6..12.5), n=16 | 0.0% (-5.8..2.4), n=16 |
+
+### Impact of ROCWMMA
+| Context | Test | Compared Envs | Pairs | Median Δ% |
+| --- | --- | --- | ---: | ---: |
+| ROCm 7 RC (hipBLASLt) | pp512 | ROCm 7 RC + ROCWMMA + hipBLASLt vs ROCm 7 RC (hipBLASLt) | 16 | 16.3% |
+| ROCm 7 RC (hipBLASLt) | tg128 | ROCm 7 RC + ROCWMMA + hipBLASLt vs ROCm 7 RC (hipBLASLt) | 16 | -0.7% |
+| ROCm 7 RC (hipBLASLt OFF) | pp512 | ROCm 7 RC + ROCWMMA (hipBLASLt OFF) vs ROCm 7 RC (hipBLASLt OFF) | 15 | 14.6% |
+| ROCm 7 RC (hipBLASLt OFF) | tg128 | ROCm 7 RC + ROCWMMA (hipBLASLt OFF) vs ROCm 7 RC (hipBLASLt OFF) | 15 | -0.7% |
+| ROCm 6.4.3 (hipBLASLt) | pp512 | ROCm 6.4.3 + ROCWMMA (hipBLASLt) vs ROCm 6.4.3 (hipBLASLt) | 15 | 17.4% |
+| ROCm 6.4.3 (hipBLASLt) | tg128 | ROCm 6.4.3 + ROCWMMA (hipBLASLt) vs ROCm 6.4.3 (hipBLASLt) | 15 | -0.3% |
+| ROCm 6.4.3 (hipBLASLt OFF) | pp512 | ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) vs ROCm 6.4.3 (hipBLASLt OFF) | 9 | 10.2% |
+| ROCm 6.4.3 (hipBLASLt OFF) | tg128 | ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) vs ROCm 6.4.3 (hipBLASLt OFF) | 9 | 0.3% |
+
+### Impact of hipBLASLt
+| Context | Test | Compared Envs | Pairs | Median Δ% |
+| --- | --- | --- | ---: | ---: |
+| ROCm 7 RC (no ROCWMMA) | pp512 | ROCm 7 RC (hipBLASLt) vs ROCm 7 RC (hipBLASLt OFF) | 15 | -0.2% |
+| ROCm 7 RC (no ROCWMMA) | tg128 | ROCm 7 RC (hipBLASLt) vs ROCm 7 RC (hipBLASLt OFF) | 15 | -0.1% |
+| ROCm 7 RC + ROCWMMA | pp512 | ROCm 7 RC + ROCWMMA + hipBLASLt vs ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 16 | 1.4% |
+| ROCm 7 RC + ROCWMMA | tg128 | ROCm 7 RC + ROCWMMA + hipBLASLt vs ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 16 | 0.0% |
+| ROCm 6.4.3 (no ROCWMMA) | pp512 | ROCm 6.4.3 (hipBLASLt) vs ROCm 6.4.3 (hipBLASLt OFF) | 9 | 155.5% |
+| ROCm 6.4.3 (no ROCWMMA) | tg128 | ROCm 6.4.3 (hipBLASLt) vs ROCm 6.4.3 (hipBLASLt OFF) | 9 | 0.0% |
+| ROCm 6.4.3 + ROCWMMA | pp512 | ROCm 6.4.3 + ROCWMMA (hipBLASLt) vs ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 13 | 116.9% |
+| ROCm 6.4.3 + ROCWMMA | tg128 | ROCm 6.4.3 + ROCWMMA (hipBLASLt) vs ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 13 | -0.0% |
+
+### Vulkan: AMDVLK vs RADV
+Head-to-head wins with selected Flash Attention filter:
+| Test | AMDVLK wins | RADV wins | Ties | Total |
+| --- | ---: | ---: | ---: | ---: |
+| pp512 | 13 | 2 | 0 | 15 |
+| tg128 | 2 | 13 | 0 | 15 |

 ---

 ## Recommendations
-
-* **Fastest prompt processing:** ROCm 7 RC + ROCWMMA + hipBLASLt (Flash Attention ON)
-* **Fastest token generation:** Vulkan RADV (Flash Attention ON)
-* **Balanced performance:** Vulkan AMDVLK (fast PP & decent TG, but ≤ 2 GiB buffer limit)
-* **BF16 models:** ROCm 7 RC + ROCWMMA + hipBLASLt (best ROCm PP/TG combo, stable with FA ON)
-* **Maximum stability:** Vulkan RADV
+- **Fastest prompt processing:** ROCm 6.4.3 + ROCWMMA (hipBLASLt) (most 1st-place finishes with selected Flash Attention filter).
+- **Fastest token generation:** Vulkan RADV (most 1st-place finishes with selected Flash Attention filter).
+- **Balanced choice:** ROCm 6.4.3 + ROCWMMA (hipBLASLt) (consistently near the top across PP/TG).

 ---

 ## Winner calculation
-
-A backend is counted as a winner if its mean throughput is within the best backend’s pooled ± error margin for that model/test type. This ensures results within measurement noise are treated as ties, not false losses.
+A backend is counted as a winner if its mean throughput is within the best backend’s pooled ± error margin for that model/test type. This treats results within measurement noise as ties instead of false losses.