- Corrected typo in WMMA (was spelt wrong as waam)

- Included rocm-7rc-rocwmma toolbox - Included updated results from benchmarks including rocm 7rc with ROMWMMA and hipBLASLt
2025-08-10 13:21:06 +01:00
parent 19fc866a9d
commit a9618d881b
619 changed files with 16448 additions and 4651 deletions
@@ -1,39 +1,44 @@
 # AMD Strix Halo — llama.cpp Toolboxes (Benchmarks)

-**Live results:** [https://kyuz0.github.io/amd-strix-halo-toolboxes/](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
+**Interactive results:** [https://kyuz0.github.io/amd-strix-halo-toolboxes/](https://kyuz0.github.io/amd-strix-halo-toolboxes/)

- Filter by model name, size, and quantization
- Select backends with or without **Flash Attention (FA)**
- Compare pp512 and tg128 side-by-side
- Winners are computed with an error-aware tolerance rule.
+* Filter by model name, size, and quantization
+* Select backends with or without **Flash Attention**
+* Compare pp512 and tg128 side-by-side
+* Winners are computed using an **error-aware tolerance rule** — if two results overlap within their ± error margins, both are counted as winners.

 ---

 ## Benchmark methodology

-* **pp512** — prompt processing throughput (tokens/sec)
-* **tg128** — text generation throughput (tokens/sec)
-* Each backend tested twice:
+* **pp512** — prompt processing throughput (tokens/sec, prefill)
+* **tg128** — token generation throughput (tokens/sec, interactive)
+* Each backend tested twice per model:

-  * FA off: `-fa 0`
-  * FA on:  `-fa 1`
-* Winners determined per model using pooled ± error from both results; multiple winners are possible.
+  * **Flash Attention OFF:** `-fa 0`
+  * **Flash Attention ON:**  `-fa 1`
+* Winners are determined per model using pooled ± error from all relevant runs; multiple winners are possible.
+* All runs were built from the same `llama.cpp` commit for consistency.

-Tested backends:
+**Tested backends:**

 * Vulkan RADV
 * Vulkan AMDVLK
 * ROCm 6.4.2
-* ROCm 6.4.2 + rocWMMA
-* ROCm 7.x (beta / rc)
+* ROCm 6.4.2 + ROCWMMA
+* ROCm 7.x (beta / RC)
+* ROCm 7.x + ROCWMMA + hipBLASLt

-All runs built from the same llama.cpp commit.
+**Note on ROCm 7 hipBLASLt:**
+All ROCm 7 toolboxes ship with **hipBLASLt enabled by default** (`ROCBLAS_USE_HIPBLASLT=1`) because it improves performance and stability in most cases.
+However, the benchmark script also includes runs with **hipBLASLt disabled** (`-hblt0`) so we can measure the impact directly.

 ---

 ## Running benchmarks

 Place `.gguf` models in `models/` (for sharded models, include only the first shard: `*-00001-of-*.gguf`).
+
 Run:

 ```bash
@@ -60,66 +65,60 @@ python benchmark/summarize_results.py

 ---

-## Summary of current dataset
+## Summary of current dataset (margin-aware, Flash Attention ON)

-### pp512 (prompt processing)
+### Prompt Processing (pp512)

-* **Vulkan AMDVLK** leads in average throughput and most frequent wins.
+* **ROCm 7 RC + ROCWMMA + hipBLASLt** dominates — **15 wins/ties** out of 22 models.
+* **Vulkan AMDVLK** is second most frequent winner (**4 wins/ties**) but can’t load certain architectures due to the ≤ 2 GiB single-buffer limit.
+* **Vulkan RADV** rarely wins in PP but is highly stable.

-  * Winner count: AMDVLK (FA on) – 11 models; AMDVLK (FA off) – 3 models.
-  * Average t/s: AMDVLK (FA off) – 422.46; AMDVLK (FA on) – 388.68.
-* **Vulkan RADV** is competitive and shows wins on multiple models.
+### Token Generation (tg128)

-  * Winner count: RADV (FA on) – 3 models.
-  * Average t/s: RADV (FA on) – 279.95; RADV (FA off) – 273.54.
-* **ROCm 6.4.2 + rocWMMA** is strong in some cases.
-
-  * Winner count: 2 models (FA on).
-  * Average t/s: rocWMMA (FA on) – 335.44.
-* ROCm 7.x variants trail in pp512 averages.
-
-**Conclusion:** AMDVLK is generally fastest for prompt processing. RADV is close on certain models and is less prone to instability. ROCm+rocWMMA can match or exceed in select cases but is inconsistent.
+* **Vulkan RADV** leads — **13 wins/ties** out of 15 possible.
+* **Vulkan AMDVLK** is a strong second, usually just behind RADV in TG.
+* **ROCm 7 RC + ROCWMMA + hipBLASLt** generally lags in TG but still posts competitive results for some models.

 ---

-### tg128 (text generation)
+### Placement counts (margin-aware, Flash Attention ON)

-* **Vulkan RADV** shows the most frequent wins.
+**Prompt Processing (pp512)**

-  * Winner count: RADV (FA off) – 6 models; RADV (FA on) – 5 models.
-  * Average t/s: RADV (FA off) – 23.73; RADV (FA on) – 23.45.
-* **Vulkan AMDVLK** wins in some cases but is less dominant than in pp512.
+| Backend                         |    1st | 2nd | 3rd |
+| ------------------------------- | -----: | --: | --: |
+| ROCm 7 RC + ROCWMMA + hipBLASLt | **15** |   2 |   1 |
+| Vulkan AMDVLK                   |      4 |   5 |   1 |
+| Vulkan RADV                     |      0 |   2 |   2 |

-  * Winner count: AMDVLK (FA off) – 4 models.
-  * Average t/s: AMDVLK (FA off) – 25.91; AMDVLK (FA on) – 23.85.
-* **ROCm 6.4.2 + rocWMMA** achieves the highest average t/s.
+**Token Generation (tg128)**

-  * Average t/s: rocWMMA (FA on) – 32.51; rocWMMA (FA off) – 31.96.
-* ROCm 7.x and ROCm 6.4.2 also appear among winners in several models.
-
-**Conclusion:** RADV is the most consistent for text generation wins. ROCm+rocWMMA delivers the highest averages but with potential stability issues. AMDVLK is competitive but not consistently ahead.
+| Backend                         |    1st | 2nd | 3rd |
+| ------------------------------- | -----: | --: | --: |
+| Vulkan RADV                     | **13** |   1 |   1 |
+| Vulkan AMDVLK                   |      1 |  10 |   1 |
+| ROCm 7 RC + ROCWMMA + hipBLASLt |      1 |   1 |   6 |

 ---

-## Flash Attention (FA)
+## Flash Attention

-FA effects vary:
-
-* In pp512 averages, AMDVLK performs better without FA.
-* In tg128, the effect depends on backend and model.
-  FA should be treated as a per-model tuning parameter rather than enabled or disabled globally.
+* **ROCm 7 RC + ROCWMMA + hipBLASLt** benefits noticeably from Flash Attention ON in prompt processing, with no stability penalties recorded.
+* **Vulkan AMDVLK** and **Vulkan RADV** show mixed changes — some models improve with FA, others slow down slightly.
+* FA should be enabled or disabled **per model/backend** based on measured performance.

 ---

 ## Recommendations

-* **Stability priority:** Vulkan RADV.
-* **Maximum pp512 throughput:** Vulkan AMDVLK, validate per model.
-* **High tg128 averages:** ROCm 6.4.2 + rocWMMA, test stability.
-* **FA setting:** Evaluate per model/backend using side-by-side comparison.
+* **Fastest prompt processing:** ROCm 7 RC + ROCWMMA + hipBLASLt (Flash Attention ON)
+* **Fastest token generation:** Vulkan RADV (Flash Attention ON)
+* **Balanced performance:** Vulkan AMDVLK (fast PP & decent TG, but ≤ 2 GiB buffer limit)
+* **BF16 models:** ROCm 7 RC + ROCWMMA + hipBLASLt (best ROCm PP/TG combo, stable with FA ON)
+* **Maximum stability:** Vulkan RADV

 ---

 ## Winner calculation

-A backend is a winner if its mean throughput is within the best backend’s pooled ± error margin for that model and test type.
+A backend is counted as a winner if its mean throughput is within the best backend’s pooled ± error margin for that model/test type. This ensures results within measurement noise are treated as ties, not false losses.
@@ -4,7 +4,7 @@
 <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width,initial-scale=1" />
-    <title>Strix Halo — Model ↔ Backend Comparator</title>
+    <title>AMD Ryzen AI MAX+ 395 "Strix Halo" — Llama.cpp Backend Performance Comparison</title>
    <style>
        :root {
            --bg: #ffffff;
@@ -357,7 +357,7 @@

 <body>
    <header>
-        <h1>Strix Halo — llama.cpp Backend Comparator</h1>
+        <h1>AMD Ryzen AI MAX+ 395 "Strix Halo" — Llama.cpp Backend Performance Comparison</h1>
        <p class="muted">
            Compare model throughput across backends (pp512 & tg128).
            Repo: <a href="https://github.com/kyuz0/amd-strix-halo-toolboxes" target="_blank"
@@ -400,8 +400,8 @@
            <label for="faMode">Flash Attention</label>
            <select id="faMode">
                <option value="off">FA off</option>
-                <option value="on">FA on</option> <!-- default ON -->
-                <option value="both" selected>Both</option>
+                <option value="on" selected>FA on</option> <!-- default ON -->
+                <option value="both">Both</option>
            </select>
        </div>
    </div>
@@ -485,12 +485,14 @@
            const envs = [...new Set(allRuns.map(r => r.env))].sort();
            function envBox(env) {
                const roc = env.includes('rocwmma');
+                const hipBLASTt_off = env.includes('hblt0');
                const id = `env_${env.replace(/[^a-z0-9_-]/gi, '_')}`;
                return `
      <div class="colbox">
        <label for="${id}">
-          <input id="${id}" type="checkbox" data-env="${env}" ${/vulkan_amdvlk|vulkan_radv|rocm6_4_2/.test(env) ? 'checked' : ''}>
-          <span><strong>${env}</strong>${roc ? '<span class="badge roc">rocWMMA</span>' : ''}</span>
+         <input id="${id}" type="checkbox" data-env="${env}"
+            ${/(vulkan_amdvlk|vulkan_radv|rocm6_4_2|rocm7_rc-rocwmma)(?![-\w])/.test(env.trim()) ? 'checked' : ''}>
+          <span><strong>${env}</strong>${roc ? '<span class="badge roc">rocWMMA</span>' : ''}</span></strong>${hipBLASTt_off ? '<span class="badge roc">hipBLASTt OFF</span>' : ''}</span>
        </label>
      </div>`;
            }
@@ -501,41 +503,67 @@
            quants.forEach(q => { const o = document.createElement('option'); o.value = (q === 'UNKNOWN' ? '' : q); o.textContent = q; quantSel.appendChild(o); });

            // --- Dual range slider setup ---
-            const sizes = Object.values(byModel).map(m => m.sizeB).filter(v => typeof v === 'number').sort((a, b) => a - b);
-            const MIN_B = sizes[0] ?? 0;
-            const MAX_B = sizes[sizes.length - 1] ?? 100;
-            [sizeLo, sizeHi].forEach(inp => { inp.min = MIN_B; inp.max = MAX_B; inp.step = 1; });
+            const sizes = Object.values(byModel)
+                .map(m => m.sizeB)
+                .filter(v => typeof v === 'number')
+                .sort((a, b) => a - b);
+
+            // force clean integer min/max
+            const MIN_B = Math.floor(sizes[0] ?? 0);
+            const MAX_B = Math.ceil(sizes[sizes.length - 1] ?? 100);
+
+            [sizeLo, sizeHi].forEach(inp => {
+                inp.min = MIN_B;
+                inp.max = MAX_B;
+                inp.step = 1; // integers only
+            });
+
            sizeLo.value = MIN_B;
            sizeHi.value = MAX_B;
+
            const filters = { sizeLo: MIN_B, sizeHi: MAX_B };

-            function fmtB(n) { return `${Number(n).toFixed(0)}B`; }
+            function fmtB(n) {
+                return `${Number(n).toFixed(0)}B`;
+            }
+
            function clampRange() {
                if (+sizeLo.value > +sizeHi.value) {
-                    const active = document.activeElement === sizeLo ? sizeLo : sizeHi;
-                    if (active === sizeLo) sizeHi.value = sizeLo.value;
-                    else sizeLo.value = sizeHi.value;
+                    if (document.activeElement === sizeLo) {
+                        sizeHi.value = sizeLo.value;
+                    } else {
+                        sizeLo.value = sizeHi.value;
+                    }
                }
            }
+
            function paintTrack() {
                const a = (+sizeLo.value - MIN_B) / (MAX_B - MIN_B) * 100;
                const b = (+sizeHi.value - MIN_B) / (MAX_B - MIN_B) * 100;
-                sizeTrack.style.background =
-                    `linear-gradient(to right,
-                #e5e5e5 ${a}%,
-                var(--accent) ${a}%,
-                var(--accent) ${b}%,
-                #e5e5e5 ${b}%)`;
+                sizeTrack.style.background = `
+        linear-gradient(to right,
+            #e5e5e5 ${a}%,
+            var(--accent) ${a}%,
+            var(--accent) ${b}%,
+            #e5e5e5 ${b}%)`;
            }
+
            function updateSizeUI(pushRender = true) {
                clampRange();
+
+                // Fix rounding: snap to min/max if slider is at extremes
+                if (Math.abs(sizeLo.value - MIN_B) < 0.0001) sizeLo.value = MIN_B;
+                if (Math.abs(sizeHi.value - MAX_B) < 0.0001) sizeHi.value = MAX_B;
+
                sizeLoVal.textContent = fmtB(sizeLo.value);
                sizeHiVal.textContent = fmtB(sizeHi.value);
                filters.sizeLo = +sizeLo.value;
                filters.sizeHi = +sizeHi.value;
+
                paintTrack();
                if (pushRender) render();
            }
+
            sizeLo.addEventListener('input', () => updateSizeUI(true));
            sizeHi.addEventListener('input', () => updateSizeUI(true));
            updateSizeUI(false);
@@ -544,9 +572,11 @@
            function matchesFilters(name) {
                const q = (search.value || '').toLowerCase();
                if (q && !name.toLowerCase().includes(q)) return false;
+
                const info = byModel[name];
                if (quantSel.value && (info.quant || '').toUpperCase() !== quantSel.value.toUpperCase()) return false;
-                if (info.sizeB != null && (info.sizeB < filters.sizeLo || info.sizeB > filters.sizeHi)) return false;
+
+                if (info.sizeB != null && (info.sizeB < filters.sizeLo - 1e-9 || info.sizeB > filters.sizeHi + 1e-9)) return false;
                return true;
            }