amd-strix-halo-toolboxes/docs/benchmarks.md

# AMD Strix Halo — llama.cpp Toolboxes (Benchmarks)

**Interactive results:** [https://kyuz0.github.io/amd-strix-halo-toolboxes/](https://kyuz0.github.io/amd-strix-halo-toolboxes/)

* Filter by model name, size, and quantization
* Select backends with or without **Flash Attention**
* Compare pp512 and tg128 side-by-side
* Winners are computed using an **error-aware tolerance rule** — if two results overlap within their ± error margins, both are counted as winners.

---

## Benchmark methodology

* **pp512** — prompt processing throughput (tokens/sec, prefill)
* **tg128** — token generation throughput (tokens/sec, interactive)
* Each backend tested twice per model:

  * **Flash Attention OFF:** `-fa 0`
  * **Flash Attention ON:**  `-fa 1`
* Winners are determined per model using pooled ± error from all relevant runs; multiple winners are possible.
* All runs were built from the same `llama.cpp` commit for consistency.

**Tested backends:**

* Vulkan RADV
* Vulkan AMDVLK
* ROCm 6.4.2
* ROCm 6.4.2 + ROCWMMA
* ROCm 7.x (beta / RC)
* ROCm 7.x + ROCWMMA + hipBLASLt

**Note on ROCm 7 hipBLASLt:**
All ROCm 7 toolboxes ship with **hipBLASLt enabled by default** (`ROCBLAS_USE_HIPBLASLT=1`) because it improves performance and stability in most cases.
However, the benchmark script also includes runs with **hipBLASLt disabled** (`-hblt0`) so we can measure the impact directly.

---

## Running benchmarks

Place `.gguf` models in `models/` (for sharded models, include only the first shard: `*-00001-of-*.gguf`).

Run:

```bash
benchmark/run_benchmarks.sh
```

This will:

* Detect models
* Execute each backend twice (FA off / FA on)
* Save logs in `benchmark/results/`

Generate `results.json` for analysis:

```bash
python benchmark/parse_results_to_json.py
```

Optional: print summary statistics:

```bash
python benchmark/summarize_results.py
```

---

## Summary of current dataset (margin-aware, Flash Attention ON)

### Prompt Processing (pp512)

* **ROCm 7 RC + ROCWMMA + hipBLASLt** dominates — **15 wins/ties** out of 22 models.
* **Vulkan AMDVLK** is second most frequent winner (**4 wins/ties**) but can’t load certain architectures due to the ≤ 2 GiB single-buffer limit.
* **Vulkan RADV** rarely wins in PP but is highly stable.

### Token Generation (tg128)

* **Vulkan RADV** leads — **13 wins/ties** out of 15 possible.
* **Vulkan AMDVLK** is a strong second, usually just behind RADV in TG.
* **ROCm 7 RC + ROCWMMA + hipBLASLt** generally lags in TG but still posts competitive results for some models.

---

### Placement counts (margin-aware, Flash Attention ON)

**Prompt Processing (pp512)**

| Backend                         |    1st | 2nd | 3rd |
| ------------------------------- | -----: | --: | --: |
| ROCm 7 RC + ROCWMMA + hipBLASLt | **15** |   2 |   1 |
| Vulkan AMDVLK                   |      4 |   5 |   1 |
| Vulkan RADV                     |      0 |   2 |   2 |

**Token Generation (tg128)**

| Backend                         |    1st | 2nd | 3rd |
| ------------------------------- | -----: | --: | --: |
| Vulkan RADV                     | **13** |   1 |   1 |
| Vulkan AMDVLK                   |      1 |  10 |   1 |
| ROCm 7 RC + ROCWMMA + hipBLASLt |      1 |   1 |   6 |

---

## Flash Attention

* **ROCm 7 RC + ROCWMMA + hipBLASLt** benefits noticeably from Flash Attention ON in prompt processing, with no stability penalties recorded.
* **Vulkan AMDVLK** and **Vulkan RADV** show mixed changes — some models improve with FA, others slow down slightly.
* FA should be enabled or disabled **per model/backend** based on measured performance.

---

## Recommendations

* **Fastest prompt processing:** ROCm 7 RC + ROCWMMA + hipBLASLt (Flash Attention ON)
* **Fastest token generation:** Vulkan RADV (Flash Attention ON)
* **Balanced performance:** Vulkan AMDVLK (fast PP & decent TG, but ≤ 2 GiB buffer limit)
* **BF16 models:** ROCm 7 RC + ROCWMMA + hipBLASLt (best ROCm PP/TG combo, stable with FA ON)
* **Maximum stability:** Vulkan RADV

---

## Winner calculation

A backend is counted as a winner if its mean throughput is within the best backend’s pooled ± error margin for that model/test type. This ensures results within measurement noise are treated as ties, not false losses.