Adding new benchmarks

2025-08-09 11:25:44 +01:00
parent 8972ef01ff
commit bc9483b75d
5 changed files with 312 additions and 395 deletions
@@ -149,31 +149,39 @@ HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3-Coder-30B-A3B
 `HF_HUB_ENABLE_HF_TRANSFER=1` uses a Rust-based package that enables faster download (install from [Pypi](https://pypi.org/project/hf-transfer/)).
 ## 3. Performance Benchmarks (Key Results)
 Got it — here’s the **concise, no-“we”** version, with the table embedded and pointing to deeper analysis.
-Below are some results from real runs on Strix Halo hardware of `llama-bench`. For full tables and model-by-model breakdowns (including both prompt processing and token generation speeds), see [docs/benchmarks.md](docs/benchmarks.md).
+---
-| Model | Vulkan (AMDVLK) | Vulkan (RADV) | ROCm 6.4.2 | ROCm 7.0 Beta | ROCm 7.0 RC | 🏆 Best PP | 🏆 Best TG |
+## 🔍 Key Findings from Benchmarks
 |---|---|---|---|---|---|---|---|
 | **Gemma3 12B Q8_0** | 683 pp / 13.8 tg | 509 pp / 13.7 tg | 223 pp / 13.8 tg | 223 pp / 13.8 tg | 223 pp / 13.8 tg | 🏆 **AMDVLK** | 🏆 **AMDVLK** |
 | **Gemma3 27B BF16** | ⚠️ Load Error | 135 pp / 4.0 tg | 89 pp / 4.0 tg | 82 pp / 4.0 tg | 83 pp / 4.0 tg | 🏆 **RADV** | 🏆 **ROCm6.4.2** |
 | **Llama-4-Scout 17B Q8_0** | 239 pp / 12.2 tg | 146 pp / 12.3 tg | ⚠️ GPU Hang | ⚠️ GPU Hang | ⚠️ Runtime Error | 🏆 **AMDVLK** | 🏆 **RADV** |
 | **Llama-4-Scout 17B Q4_K XL** | 209 pp / 20.1 tg | 133 pp / 20.0 tg | 133 pp / 17.3 tg | 134 pp / 17.4 tg | ⚠️ Runtime Error | 🏆 **AMDVLK** | 🏆 **AMDVLK** |
 | **Qwen3 30B BF16** | 91 pp / 8.0 tg | 71 pp / 7.3 tg | 158 pp / 22.9 tg | 151 pp / 23.8 tg | 155 pp / 23.1 tg | 🏆 **ROCm6.4.2** | 🏆 **ROCm7 Beta** |
 | **Qwen3-235B Q3_K XL** | 100 pp / 15.7 tg | 58 pp / 16.3 tg | 69 pp / 13.5 tg | ⚠️ GPU Hang | 75 pp / 13.6 tg | 🏆 **AMDVLK** | 🏆 **RADV** |
 | **GLM-4.5-Air-UD-Q4_K_XL** | 200 pp / 22.8 tg | 128 pp / 22.9 tg | ⚠️ Runtime Error | ⚠️ GPU Hang | 129 pp / 19.6 tg | 🏆 **AMDVLK** | 🏆 **RADV** |
 | **GLM-4.5-Air-UD-Q6_K_XL** | 221 pp / 16.5 tg | 127 pp / 16.8 tg | 125 pp / 15.3 tg | ⚠️ GPU Hang | ⚠️ Runtime Error | 🏆 **AMDVLK** | 🏆 **RADV** |
 | **gpt-oss-120b-mxfp4** | 486 pp / 48.1 tg | 239 pp / 48.9 tg | 353 pp / 43.6 tg | ⚠️ GPU Hang | 351 pp / 44.6 tg | 🏆 **AMDVLK** | 🏆 **RADV** |
 | **gpt-oss-20b-mxfp4** | 1206 pp / 68.9 tg | 647 pp / 69.8 tg | 581 pp / 64.3 tg | 584 pp / 64.4 tg | 584 pp / 64.4 tg | 🏆 **AMDVLK** | 🏆 **RADV** |
-* **pp = tokens/sec, prompt processing (pre-fill, max speed)**
+Representative LLMs were tested on **AMD Ryzen AI Max “Strix Halo”** across all supported backends, using identical model builds in [Llama.cpp](https://github.com/ggerganov/llama.cpp).
 * **tg = tokens/sec, generation (interactive, single token at a time)**
 * 🏆 denotes the winner
-**Takeaways:**
+PP = prompt processing (tokens/sec prefill), TG = token generation (tokens/sec interactive).
-* **Vulkan AMDVLK** is the fastest, when it works. There's currently an issue with memory allocation that causes some models to fail loading ([GitHub Issue 15054](https://github.com/ggml-org/llama.cpp/issues/15054)). 
+| Model | Vulkan (AMDVLK) | Vulkan (RADV) | ROCm 6.4.2 | ROCm 6.4.2 + ROCWMMA | ROCm 7.0 Beta | ROCm 7.0 RC | 🏆 Best PP | 🏆 Best TG |
-* **Vulkan RADV** is the most stable and compatible (recommended for most usage).
+|---|---|---|---|---|---|---|---|---|
-* **ROCm** is typically only superior on BF16 models, otherwise less stable and may crash or hang.
+| **Gemma3 12B Q8_0** | 677 pp / 14.0 tg | 503 pp / 13.8 tg | 223 pp / 13.8 tg | 223 pp / 13.9 tg | 223 pp / 13.9 tg | 222 pp / 13.9 tg | 🏆 **AMDVLK** | — |
 | **Gemma3 27B BF16** | — | 136 pp / 4.0 tg | 84 pp / 4.0 tg | 93 pp / 4.0 tg | 92 pp / 4.0 tg | 56 pp / 3.1 tg | 🏆 **RADV** | — |
 | **Llama-4-Scout 17B Q8_0** | 258 pp / 12.2 tg | 169 pp / 12.3 tg | 135 pp / 11.6 tg | — | — | — | 🏆 **AMDVLK** | — |
 | **Llama-4-Scout 17B Q4_K XL** | 218 pp / 20.0 tg | 152 pp / 20.0 tg | 138 pp / 17.4 tg | — | 139 pp / 17.6 tg | 124 pp / 17.6 tg | 🏆 **AMDVLK** | — |
 | **Qwen3 30B BF16** | 107 pp / 8.0 tg | 86 pp / 7.4 tg | 158 pp / 23.9 tg | 158 pp / 24.5 tg | 153 pp / 24.5 tg | 152 pp / 24.6 tg | 🏆 **ROCm6.4.2+ROCWMMA** | — |
 | **Qwen3-235B Q3_K XL** | 114 pp / 16.0 tg | 65 pp / 16.6 tg | 74 pp / 13.7 tg | — | — | — | 🏆 **AMDVLK** | — |
 | **GLM-4.5-Air-Q4_K_XL** | 201 pp / 22.8 tg | 128 pp / 22.9 tg | 130 pp / 19.4 tg | — | — | 130 pp / 19.8 tg | 🏆 **AMDVLK** | — |
 | **GLM-4.5-Air-Q6_K_XL** | 223 pp / 16.5 tg | 127 pp / 16.8 tg | 125 pp / 15.3 tg | 114 pp / 15.5 tg | 121 pp / 15.5 tg | 124 pp / 15.5 tg | 🏆 **AMDVLK** | — |
 | **gpt-oss-120b-mxfp4** | 487 pp / 48.1 tg | 240 pp / 49.0 tg | 353 pp / 44.1 tg | 354 pp / 45.0 tg | 355 pp / 45.0 tg | 353 pp / 45.1 tg | 🏆 **AMDVLK** | — |
 | **gpt-oss-20b-mxfp4** | 1205 pp / 68.8 tg | 649 pp / 69.9 tg | 583 pp / 64.5 tg | 581 pp / 64.5 tg | 584 pp / 64.4 tg | 582 pp / 64.5 tg | 🏆 **AMDVLK** | — |
 **Observations:**
 * **AMDVLK (Vulkan)** delivers the highest prompt processing speeds for most models, but is limited by ≤2 GiB single-buffer allocation and may fail to load some models.
 * **RADV (Vulkan)** is the most stable and compatible backend; typically slower than AMDVLK in PP but often competitive in TG.
 * **ROCm 6.4.2 + ROCWMMA** excels in BF16 workloads and can outperform Vulkan in certain cases, though ROCm stability issues remain.
 * ROCm 7.0 Beta/RC show similar performance to 6.4.2 without consistent gains.
 📄 Full per-model analysis: [docs/benchmarks.md](docs/benchmarks.md)
 🌐 Interactive exploration: [Live Benchmark Viewer](https://your-live-results-url)
 ## 4. Memory Planning & VRAM Estimator
@@ -1,108 +1,118 @@
 #!/usr/bin/env python3
-import re, glob, os, argparse
+import json
 from pathlib import Path
-PP_RE = re.compile(r"\|[^|]*\|[^|]*\|[^|]*\|[^|]*\|[^|]*\|\s*pp512\s*\|\s*([\d.]+)\s*±\s*([\d.]+)")
+# --- Config ---
-TG_RE = re.compile(r"\|[^|]*\|[^|]*\|[^|]*\|[^|]*\|[^|]*\|\s*tg128\s*\|\s*([\d.]+)\s*±\s*([\d.]+)")
+RESULTS_JSON = Path("../docs/results.json")
-LOAD_ERR = re.compile(r"failed to load model|Device memory allocation.*failed", re.IGNORECASE)
+
-HANG_ERR = re.compile(r"GPU Hang|HW Exception", re.IGNORECASE)
+ENV_ORDER = [
-GEN_ERR  = re.compile(r"error:|exit \d+", re.IGNORECASE)
+    "vulkan_amdvlk",
    "vulkan_radv",
    "rocm6_4_2",
    "rocm6_4_2-rocwmma",
    "rocm7_beta",
    "rocm7_rc"
 ]
 ENV_ORDER = ["vulkan_amdvlk","vulkan_radv","rocm6_4_2","rocm7_beta","rocm7_rc"]
 COL_NAMES = {
-    "vulkan_amdvlk":"Vulkan (AMDVLK)",
+    "vulkan_amdvlk": "Vulkan (AMDVLK)",
-    "vulkan_radv":"Vulkan (RADV)",
+    "vulkan_radv": "Vulkan (RADV)",
-    "rocm6_4_2":"ROCm 6.4.2",
+    "rocm6_4_2": "ROCm 6.4.2",
-    "rocm7_beta":"ROCm 7.0 Beta",
+    "rocm6_4_2-rocwmma": "ROCm 6.4.2 + ROCWMMA",
-    "rocm7_rc":"ROCm 7.0 RC",
+    "rocm7_beta": "ROCm 7.0 Beta",
    "rocm7_rc": "ROCm 7.0 RC"
 }
-WINNER = {
+
-    "vulkan_amdvlk":"AMDVLK",
+WINNER_LABELS = {
-    "vulkan_radv":"RADV",
+    "vulkan_amdvlk": "AMDVLK",
-    "rocm6_4_2":"ROCm6.4.2",
+    "vulkan_radv": "RADV",
-    "rocm7_beta":"ROCm7 Beta",
+    "rocm6_4_2": "ROCm6.4.2",
-    "rocm7_rc":"ROCm7 RC",
+    "rocm6_4_2-rocwmma": "ROCm6.4.2+ROCWMMA",
    "rocm7_beta": "ROCm7 Beta",
    "rocm7_rc": "ROCm7 RC"
 }
 DEFAULT_MODELS = [
-    ("Gemma3 12B Q8_0",                  "gemma-3-12b-it-UD-Q8_K_XL"),
+    ("Gemma3 12B Q8_0", "gemma-3-12b-it-UD-Q8_K_XL"),
-    ("Gemma3 27B BF16",                  "gemma-3-27b-it-BF16"),
+    ("Gemma3 27B BF16", "gemma-3-27b-it-BF16"),
-    ("Llama-4-Scout 17B Q8_0",           "Llama-4-Scout-17B-16E-Instruct-Q8_0"),
+    ("Llama-4-Scout 17B Q8_0", "Llama-4-Scout-17B-16E-Instruct-Q8_0"),
-    ("Llama-4-Scout 17B Q4_K XL",        "Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL"),
+    ("Llama-4-Scout 17B Q4_K XL", "Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL"),
-    ("Qwen3 30B BF16",                    "Qwen3-30B-A3B-BF16"),
+    ("Qwen3 30B BF16", "Qwen3-30B-A3B-BF16"),
-    ("Qwen3-235B Q3_K XL",               "Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL"),
+    ("Qwen3-235B Q3_K XL", "Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL"),
-    ("GLM-4.5-Air-UD-Q4_K_XL",           "GLM-4.5-Air-UD-Q4_K_XL"),
+    ("GLM-4.5-Air-Q4_K_XL", "GLM-4.5-Air-UD-Q4_K_XL"),
-    ("GLM-4.5-Air-UD-Q6_K_XL",           "GLM-4.5-Air-UD-Q6_K_XL"),
+    ("GLM-4.5-Air-Q6_K_XL", "GLM-4.5-Air-UD-Q6_K_XL"),
-    ("gpt-oss-120b-mxfp4",               "gpt-oss-120b-mxfp4"),
+    ("gpt-oss-120b-mxfp4", "gpt-oss-120b-mxfp4"),
-    ("gpt-oss-20b-mxfp4",                "gpt-oss-20b-mxfp4"),
+    ("gpt-oss-20b-mxfp4", "gpt-oss-20b-mxfp4"),
 ]
-CLEAN = lambda s: re.sub(r"-000\d+-of-000\d+", "", s)
+ERROR_LABELS = {
    "load": "⚠️ Load Error",
    "hang": "⚠️ GPU Hang",
    "runtime": "⚠️ Runtime Error"
 }
-def parse_logs():
+# --- Helpers ---
-    data = {}
+def load_results():
-    for p in glob.glob(os.path.join("results","*.log")):
+    data = json.loads(Path(RESULTS_JSON).read_text())
-        base = os.path.basename(p)[:-4]
+    return data["runs"]
        if "__" not in base:
            continue
        model_raw, env = base.split("__", 1)
        key = CLEAN(model_raw)
        t = open(p, errors="ignore").read()
        pp = PP_RE.search(t)
        tg = TG_RE.search(t)
        et = None
        if LOAD_ERR.search(t): et = "load"
        elif HANG_ERR.search(t): et = "hang"
        elif GEN_ERR.search(t) and not (pp and tg): et = "runtime"
        data.setdefault(key, {"pp512": {}, "tg128": {}})
        data[key]["pp512"][env] = {"mean": float(pp.group(1)) if (pp and et is None) else None,
                                   "error": et is not None, "etype": et}
        data[key]["tg128"][env] = {"mean": float(tg.group(1)) if (tg and et is None) else None,
                                   "error": et is not None, "etype": et}
    return data
-def best(env_data):
+def filter_runs(runs, model_prefix, env):
-    vals = {e:d["mean"] for e,d in env_data.items() if (not d["error"]) and d["mean"] is not None}
+    for r in runs:
-    return max(vals, key=vals.get) if vals else None
+        if r["model_clean"].startswith(model_prefix) and r["env"] == env:
-
+            return r
 def cell(pp, tg):
    if (pp is None) or (tg is None):
        return "—"
    if pp["error"] or tg["error"]:
        m = pp["etype"] or tg["etype"] or "runtime"
        return {"load":"⚠️ Load Error","hang":"⚠️ GPU Hang","runtime":"⚠️ Runtime Error"}.get(m, "⚠️ Error")
    return f"{int(round(pp['mean']))} pp / {tg['mean']:.1f} tg"
 def find_key(keys, prefix):
    for k in keys:
        if k.startswith(prefix):
            return k
    return None
 def format_cell(pp_run, tg_run):
    if not pp_run or not tg_run:
        return "—"
    if pp_run["error"] or tg_run["error"]:
        return ERROR_LABELS.get(pp_run["error_type"] or tg_run["error_type"], "⚠️ Error")
    if pp_run["tps_mean"] is None or tg_run["tps_mean"] is None:
        return "—"
    return f"{int(round(pp_run['tps_mean']))} pp / {tg_run['tps_mean']:.1f} tg"
 def find_winner(runs, model_prefix, bench_type):
    vals = {}
    for env in ENV_ORDER:
        r = filter_runs(runs, model_prefix, env)
        if r and not r["error"] and r["test"] == bench_type and r["tps_mean"] is not None:
            vals[env] = r["tps_mean"]
    if not vals:
        return None
    return max(vals, key=vals.get)
 # --- Main ---
 def main():
-    ap = argparse.ArgumentParser()
+    runs = load_results()
    ap.add_argument("models", nargs="*", help="Optional model prefixes to include")
    args = ap.parse_args()
    data = parse_logs()
    want = [(m,m) for m in args.models] if args.models else DEFAULT_MODELS
-    header = ["Model"] + [COL_NAMES[e] for e in ENV_ORDER] + ["🏆 Best PP","🏆 Best TG"]
+    header = ["Model"] + [COL_NAMES[e] for e in ENV_ORDER] + ["🏆 Best PP", "🏆 Best TG"]
    print("| " + " | ".join(header) + " |")
-    print("|" + "|".join(["---"]*len(header)) + "|")
+    print("|" + "|".join(["---"] * len(header)) + "|")
-    for disp, patt in want:
+    for disp_name, model_prefix in DEFAULT_MODELS:
-        key = find_key(data.keys(), patt)
+        row = [f"**{disp_name}**"]
        row = [f"**{disp}**"]
        if not key:
            row += ["—"]*len(ENV_ORDER) + ["—","—"]
            print("| " + " | ".join(row) + " |")
            continue
        ppd, tgd = data[key]["pp512"], data[key]["tg128"]
        for env in ENV_ORDER:
-            row.append(cell(ppd.get(env), tgd.get(env)))
+            pp_run = filter_runs(runs, model_prefix, env)
-        bpp, btg = best(ppd), best(tgd)
+            tg_run = filter_runs(runs, model_prefix, env)
-        row.append(f"🏆 **{WINNER[bpp]}**" if bpp else "—")
+            pp = None
-        row.append(f"🏆 **{WINNER[btg]}**" if btg else "—")
+            tg = None
            if pp_run and pp_run["test"] == "pp512":
                pp = pp_run
            if tg_run and tg_run["test"] == "tg128":
                tg = tg_run
            # match pp and tg runs by env
            pp_env_run = next((r for r in runs if r["model_clean"].startswith(model_prefix) and r["env"] == env and r["test"] == "pp512"), None)
            tg_env_run = next((r for r in runs if r["model_clean"].startswith(model_prefix) and r["env"] == env and r["test"] == "tg128"), None)
            row.append(format_cell(pp_env_run, tg_env_run))
        bpp = find_winner(runs, model_prefix, "pp512")
        btg = find_winner(runs, model_prefix, "tg128")
        row.append(f"🏆 **{WINNER_LABELS[bpp]}**" if bpp else "—")
        row.append(f"🏆 **{WINNER_LABELS[btg]}**" if btg else "—")
        print("| " + " | ".join(row) + " |")
    print("\nFull interactive results: [Live Benchmark Viewer](https://your-live-results-url)")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,77 @@
 #!/usr/bin/env python3
 import json
 from collections import defaultdict
 from statistics import mean
 # CONFIG
 TOLERANCE_MULTIPLIER = 1.0  # multiplier for std dev to count as "within best"
 def within_tolerance(best_mean, best_std, contender_mean, contender_std):
    # Winner if contender is within (best_mean - best_std * tol) of best_mean
    return contender_mean >= (best_mean - TOLERANCE_MULTIPLIER * best_std)
 # --- Load data ---
 with open("../docs/results.json", encoding="utf-8") as f:
    data = json.load(f)
 runs = data["runs"]
 # --- Group by benchmark type ---
 benchmarks = defaultdict(list)
 for r in runs:
    if r["error"]:
        continue
    if r["test"] in ("pp512", "tg128"):
        benchmarks[r["test"]].append(r)
 summary = {}
 for bench_type, results in benchmarks.items():
    winners_count = defaultdict(int)
    backend_perf = defaultdict(list)
    # Group results by model
    models = defaultdict(list)
    for r in results:
        models[r["model_clean"]].append(r)
    for model, entries in models.items():
        # Find the best mean
        best_entry = max(entries, key=lambda x: x["tps_mean"])
        best_mean = best_entry["tps_mean"]
        best_std = best_entry["tps_std"] or 0
        # Find all within tolerance
        for e in entries:
            if e["tps_mean"] is None:
                continue
            if within_tolerance(best_mean, best_std, e["tps_mean"], e["tps_std"] or 0):
                label = f"{e['env']}{' (FA on)' if e['fa'] else ' (FA off)'}"
                winners_count[label] += 1
        # Collect performance data for average TPS
        for e in entries:
            label = f"{e['env']}{' (FA on)' if e['fa'] else ' (FA off)'}"
            if e["tps_mean"] is not None:
                backend_perf[label].append(e["tps_mean"])
    # Store summary
    summary[bench_type] = {
        "winners": dict(sorted(winners_count.items(), key=lambda x: -x[1])),
        "avg_perf": {k: round(mean(v), 2) for k, v in backend_perf.items()},
        "total_models": len(models),
    }
 # --- Print human-readable analysis ---
 for bench_type in ("pp512", "tg128"):
    if bench_type not in summary:
        continue
    print(f"\n=== {bench_type.upper()} ===")
    print(f"Models tested: {summary[bench_type]['total_models']}")
    print("Winner counts (within tolerance):")
    for backend, count in summary[bench_type]["winners"].items():
        print(f"  {backend}: {count} models")
    print("Average throughput (tokens/sec):")
    for backend, avg in sorted(summary[bench_type]["avg_perf"].items(), key=lambda x: -x[1]):
        print(f"  {backend}: {avg}")
@@ -1,147 +0,0 @@
 #!/usr/bin/env python3
 """
 Script to remove host-related entries from log files and delete host files.
 """
 import os
 import glob
 import shutil
 from pathlib import Path
 def remove_host_entries_from_log(log_file):
    """
    Remove all entries that start with '[host]' from the log file.
    Each entry is separated by empty lines.
    """
    if not os.path.exists(log_file):
        print(f"Log file {log_file} not found!")
        return False
    # Create backup
    backup_file = f"{log_file}.backup"
    shutil.copy2(log_file, backup_file)
    print(f"Created backup: {backup_file}")
    with open(log_file, 'r', encoding='utf-8') as f:
        lines = f.readlines()
    filtered_lines = []
    i = 0
    while i < len(lines):
        line = lines[i].strip()
        # Check if this line starts a host entry
        if line.startswith('▶ [host]'):
            # Skip this entry by finding the next empty line or end of file
            i += 1
            while i < len(lines) and lines[i].strip() != '':
                i += 1
            # Skip the empty line too if we found one
            if i < len(lines) and lines[i].strip() == '':
                i += 1
        else:
            # Keep this line
            filtered_lines.append(lines[i])
            i += 1
    # Write the filtered content back
    with open(log_file, 'w', encoding='utf-8') as f:
        f.writelines(filtered_lines)
    print(f"Removed host entries from {log_file}")
    return True
 def remove_host_files():
    """Remove all files with 'host' in their filename."""
    host_files = glob.glob('*host*')
    if not host_files:
        print("No files with 'host' in filename found.")
        return
    print("Files to be removed:")
    for file in host_files:
        print(f"  - {file}")
    for file in host_files:
        try:
            os.remove(file)
            print(f"Removed: {file}")
        except OSError as e:
            print(f"Error removing {file}: {e}")
 def preview_host_entries(log_file):
    """Preview what host entries would be removed."""
    if not os.path.exists(log_file):
        print(f"Log file {log_file} not found!")
        return
    with open(log_file, 'r', encoding='utf-8') as f:
        lines = f.readlines()
    print("Host entries that would be removed:")
    print("-" * 50)
    i = 0
    entry_count = 0
    while i < len(lines):
        line = lines[i].strip()
        if line.startswith('▶ [host]'):
            entry_count += 1
            print(f"Entry {entry_count}:")
            # Print this entry until we hit an empty line
            while i < len(lines) and lines[i].strip() != '':
                print(lines[i].rstrip())
                i += 1
            print()  # Add empty line after entry
        else:
            i += 1
    print(f"Total host entries found: {entry_count}")
 def main():
    log_file = "run_benchmarks.log"  # Change this to your actual log file name
    print("Host Entry and File Removal Script")
    print("=" * 40)
    # Preview what would be removed
    preview_host_entries(log_file)
    # Show files that would be removed
    host_files = glob.glob('*host*')
    if host_files:
        print(f"\nFiles with 'host' in filename ({len(host_files)} found):")
        for file in host_files:
            print(f"  - {file}")
    print("\nThis script will:")
    print(f"1. Remove host entries from log file: {log_file}")
    print("2. Remove all files with 'host' in the filename")
    response = input("\nContinue? (y/N): ").strip().lower()
    if response == 'y' or response == 'yes':
        # Remove host entries from log
        if remove_host_entries_from_log(log_file):
            print("✓ Host entries removed from log file")
        # Remove host files
        remove_host_files()
        print("✓ Host files removed")
        print("\nDone!")
    else:
        print("Aborted.")
 if __name__ == "__main__":
    main()
@@ -1,152 +1,121 @@
-# 1. Benchmark Results: Strix Halo Llama.cpp Toolboxes
+# AMD Strix Halo — llama.cpp Toolboxes (Benchmarks)
-This document presents comprehensive benchmarks of all supported Llama.cpp containers and backends, focusing on real GPU workloads and model loading times on the AMD Ryzen AI Max 395 "Strix Halo" iGPU.
+**Live results:** [https://kyuz0.github.io/amd-strix-halo-toolboxes/](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
-
+Filter by model name, size, and quantization; select backends with or without **Flash Attention (FA)**; compare pp512 and tg128 side-by-side; winners are computed with an error-aware tolerance rule.
 ## 2. Benchmark Methodology
 Benchmarks cover both end-to-end performance (prompt processing and text generation) and model load times. Model load time benchmarks (llama-cli) are averaged over three runs per environment; inference benchmarks (llama-bench) use default tool settings.
 Backends tested:
 * **Vulkan RADV** (open source Vulkan driver)
 * **Vulkan AMDVLK** (official AMD open Vulkan driver)
 * **ROCm 6.4.2** (AMD's compute stack)
 * **ROCm 7.0 beta** (AMD's compute stack)
 * **ROCm 7.0 rc** (AMD's compute stack)
 ### 2.1. Llama.cpp Inference Benchmarks
 #### 2.1.1. Script: `run_benchmarks.sh`
 This script runs each model through every container/backend using the `llama-bench` tool.
 ##### Command Used
 ```bash
 llama-bench -ngl 99 -mmp 0 -m /path/to/model.gguf
 ```
 * `-ngl 99` — Use all available GPU layers
 * `-mmp 0` — Disable mmap (required for ROCm to avoid extremely slow loads for models >64GB, and also improves speed for Vulkan drivers)
 * `-m` — Path to the GGUF model file
 Script location: `benchmark/run_benchmarks.sh`
 Benchmark logs: `benchmark/results/`
 ##### Model Location
 All scripts expect models in the `models/` directory (absolute path is recommended). For sharded models, the first shard must be present and named according to the GGUF naming convention (`*-00001-of-00002.gguf`).
 ### Prompt Processing (pp512) — tokens/second
 | Model | Vulkan Radv | Vulkan Amdvlk | Rocm6 4 2 | Rocm7 Beta | Rocm7 Rc | Winner |
 |---|---|---|---|---|---|---|
 | **gemma-3-12b-it-UD-Q8_K_XL** | 508.55 ± 0.90 | 683.07 ± 1.03 | 223.36 ± 0.23 | 222.95 ± 0.15 | 222.99 ± 0.24 | 🏆 **vulkan_amdvlk** (+34%) |
 | **gemma-3-27b-it-BF16** | 135.40 ± 0.29 | ⚠️ Load Error | 88.73 ± 0.50 | 82.31 ± 0.29 | 83.18 ± 0.41 | 🏆 **vulkan_radv** (+53%) |
 | **gemma-3-4b-it-Q3_K_S** | 1520.07 ± 5.39 | 1616.55 ± 4.61 | 729.02 ± 0.82 | 729.93 ± 1.29 | 728.63 ± 1.23 | 🏆 **vulkan_amdvlk** (+6%) |
 | **GLM-4.5-Air-UD-Q4_K_XL** | 128.00 ± 0.23 | 199.54 ± 0.38 | ⚠️ Runtime Error | ⚠️ GPU Hang | 129.20 ± 0.38 | 🏆 **vulkan_amdvlk** (+54%) |
 | **GLM-4.5-Air-UD-Q6_K_XL** | 126.86 ± 0.40 | 221.02 ± 0.58 | 124.86 ± 0.54 | ⚠️ GPU Hang | ⚠️ Runtime Error | 🏆 **vulkan_amdvlk** (+74%) |
 | **gpt-oss-120b-F16** | 230.32 ± 0.72 | 449.22 ± 1.12 | ⚠️ GPU Hang | 357.68 ± 1.49 | 355.47 ± 0.55 | 🏆 **vulkan_amdvlk** (+26%) |
 | **gpt-oss-120b-mxfp4** | 239.16 ± 1.26 | 485.98 ± 2.23 | 352.53 ± 1.06 | ⚠️ GPU Hang | 351.08 ± 0.86 | 🏆 **vulkan_amdvlk** (+38%) |
 | **gpt-oss-20b-F32** | 318.82 ± 1.63 | 369.86 ± 1.57 | 323.64 ± 4.29 | 324.15 ± 3.76 | 324.27 ± 5.39 | 🏆 **vulkan_amdvlk** (+14%) |
 | **gpt-oss-20b-mxfp4** | 646.77 ± 4.63 | 1206.08 ± 8.80 | 580.67 ± 2.03 | 584.04 ± 2.48 | 584.15 ± 2.11 | 🏆 **vulkan_amdvlk** (+86%) |
 | **Kimi-Dev-72B-UD-Q8_K_XL** | 76.48 ± 0.23 | ⚠️ Load Error | ⚠️ GPU Hang | ⚠️ GPU Hang | ⚠️ Runtime Error | 🏆 **vulkan_radv** |
 | **Llama-3.3-70B-Instruct-UD-Q8_K_XL** | 79.71 ± 0.13 | 96.23 ± 0.16 | 33.17 ± 0.07 | ⚠️ GPU Hang | ⚠️ Runtime Error | 🏆 **vulkan_amdvlk** (+21%) |
 | **Llama-4-Scout-17B-16E-Instruct-Q6_K** | 137.97 ± 0.99 | 243.19 ± 1.20 | 121.52 ± 0.98 | ⚠️ GPU Hang | 135.36 ± 0.39 | 🏆 **vulkan_amdvlk** (+76%) |
 | **Llama-4-Scout-17B-16E-Instruct-Q8_0** | 145.86 ± 2.44 | 238.93 ± 2.89 | ⚠️ GPU Hang | ⚠️ GPU Hang | ⚠️ Runtime Error | 🏆 **vulkan_amdvlk** (+64%) |
 | **Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL** | 133.49 ± 1.83 | 208.84 ± 1.35 | 132.66 ± 0.56 | 133.71 ± 0.64 | ⚠️ Runtime Error | 🏆 **vulkan_amdvlk** (+56%) |
 | **llama3.3-70.6B-Q4_K_M** | 79.12 ± 0.14 | 72.75 ± 0.03 | 33.89 ± 0.03 | 33.91 ± 0.04 | 33.82 ± 0.05 | 🏆 **vulkan_radv** (+9%) |
 | **Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL** | 58.40 ± 0.21 | 99.94 ± 0.91 | 69.48 ± 0.09 | ⚠️ GPU Hang | 74.69 ± 0.17 | 🏆 **vulkan_amdvlk** (+34%) |
 | **Qwen3-30B-A3B-BF16** | 71.16 ± 0.92 | 90.91 ± 0.35 | 157.74 ± 2.65 | 151.25 ± 3.33 | 154.95 ± 1.58 | 🏆 **rocm6_4_2** (+2%) |
 | **Qwen3-Coder-30B-A3B-Instruct-BF16** | 71.53 ± 1.06 | 90.38 ± 0.57 | 150.53 ± 1.83 | 147.31 ± 2.22 | 144.59 ± 3.08 | 🏆 **rocm6_4_2** (+2%) |
 ### Text Generation (tg128) — tokens/second
 | Model | Vulkan Radv | Vulkan Amdvlk | Rocm6 4 2 | Rocm7 Beta | Rocm7 Rc | Winner |
 |---|---|---|---|---|---|---|
 | **gemma-3-12b-it-UD-Q8_K_XL** | 13.65 ± 0.02 | 13.84 ± 0.02 | 13.81 ± 0.00 | 13.80 ± 0.00 | 13.81 ± 0.00 | 🏆 **vulkan_amdvlk** (+0%) |
 | **gemma-3-27b-it-BF16** | 3.98 ± 0.00 | ⚠️ Load Error | 4.02 ± 0.00 | 3.99 ± 0.01 | 3.99 ± 0.00 | 🏆 **rocm6_4_2** (+1%) |
 | **gemma-3-4b-it-Q3_K_S** | 85.93 ± 0.09 | 83.89 ± 0.22 | 76.04 ± 0.03 | 76.52 ± 0.03 | 75.59 ± 0.03 | 🏆 **vulkan_radv** (+2%) |
 | **GLM-4.5-Air-UD-Q4_K_XL** | 22.88 ± 0.02 | 22.75 ± 0.01 | ⚠️ Runtime Error | ⚠️ GPU Hang | 19.61 ± 0.00 | 🏆 **vulkan_radv** (+1%) |
 | **GLM-4.5-Air-UD-Q6_K_XL** | 16.76 ± 0.00 | 16.47 ± 0.01 | 15.27 ± 0.00 | ⚠️ GPU Hang | ⚠️ Runtime Error | 🏆 **vulkan_radv** (+2%) |
 | **gpt-oss-120b-F16** | 33.06 ± 0.02 | 33.49 ± 0.05 | ⚠️ GPU Hang | 33.70 ± 0.01 | 33.65 ± 0.00 | 🏆 **rocm7_beta** (+0%) |
 | **gpt-oss-120b-mxfp4** | 48.93 ± 0.06 | 48.09 ± 0.04 | 43.56 ± 0.00 | ⚠️ GPU Hang | 44.63 ± 0.03 | 🏆 **vulkan_radv** (+2%) |
 | **gpt-oss-20b-F32** | 7.77 ± 0.01 | 8.59 ± 0.01 | 26.64 ± 0.06 | 26.90 ± 0.00 | 26.86 ± 0.00 | 🏆 **rocm7_beta** (+0%) |
 | **gpt-oss-20b-mxfp4** | 69.82 ± 0.03 | 68.90 ± 0.18 | 64.26 ± 0.01 | 64.37 ± 0.01 | 64.38 ± 0.01 | 🏆 **vulkan_radv** (+1%) |
 | **Kimi-Dev-72B-UD-Q8_K_XL** | 2.65 ± 0.00 | ⚠️ Load Error | ⚠️ GPU Hang | ⚠️ GPU Hang | ⚠️ Runtime Error | 🏆 **vulkan_radv** |
 | **Llama-3.3-70B-Instruct-UD-Q8_K_XL** | 2.72 ± 0.00 | 2.72 ± 0.00 | 2.72 ± 0.00 | ⚠️ GPU Hang | ⚠️ Runtime Error | 🏆 **rocm6_4_2** (+0%) |
 | **Llama-4-Scout-17B-16E-Instruct-Q6_K** | 15.07 ± 0.05 | 15.28 ± 0.03 | 14.28 ± 0.00 | ⚠️ GPU Hang | 14.29 ± 0.00 | 🏆 **vulkan_amdvlk** (+1%) |
 | **Llama-4-Scout-17B-16E-Instruct-Q8_0** | 12.27 ± 0.00 | 12.25 ± 0.01 | ⚠️ GPU Hang | ⚠️ GPU Hang | ⚠️ Runtime Error | 🏆 **vulkan_radv** (+0%) |
 | **Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL** | 19.99 ± 0.01 | 20.06 ± 0.01 | 17.29 ± 0.00 | 17.35 ± 0.00 | ⚠️ Runtime Error | 🏆 **vulkan_amdvlk** (+0%) |
 | **llama3.3-70.6B-Q4_K_M** | 4.97 ± 0.00 | 5.01 ± 0.00 | 4.59 ± 0.00 | 4.60 ± 0.00 | 4.52 ± 0.00 | 🏆 **vulkan_amdvlk** (+1%) |
 | **Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL** | 16.29 ± 0.01 | 15.72 ± 0.01 | 13.54 ± 0.01 | ⚠️ GPU Hang | 13.56 ± 0.00 | 🏆 **vulkan_radv** (+4%) |
 | **Qwen3-30B-A3B-BF16** | 7.33 ± 0.00 | 7.96 ± 0.03 | 22.88 ± 0.01 | 23.80 ± 0.09 | 23.08 ± 0.08 | 🏆 **rocm7_beta** (+3%) |
 | **Qwen3-Coder-30B-A3B-Instruct-BF16** | 7.34 ± 0.01 | 8.00 ± 0.03 | 22.13 ± 0.00 | 24.12 ± 0.06 | 23.48 ± 0.01 | 🏆 **rocm7_beta** (+3%) |
 ##### Error Legend
 * `⚠️ Load Error` — Model failed to load in this environment (usually OOM or driver error)
 * `⚠️ GPU Hang` — GPU hung during inference (may work outside stress test)
 * `⚠️ Runtime Error` — Miscellaneous runtime failure (check logs)
 ### 2.2. Model Loading Time Benchmarks
 #### 2.2.1. Script: `run_loadtime_benchmark.sh`
 This script benchmarks **model load + single-token inference** (using `llama-cli`) for every backend, using a minimal prompt. Three runs per combination are averaged.
 ##### Command Used
 ```bash
 llama-cli -ngl 999 -fa --no-mmap -no-cnv -n 1 -m /path/to/model.gguf -p "Hello"
 ```
 * `-ngl 999` — Use all available GPU layers
 * `-fa` — Enable fast attention (default for most GPU builds)
 * `--no-mmap` — Disable mmap (ensures all RAM usage is counted)
 * `-no-cnv` — Disable convolution (relevant for some models)
 * `-n 1` — Generate only one token (measures load + first inference)
 * `-m` — Path to GGUF model
 * `-p` — Prompt text ("Hello")
 Script location: `benchmark/run_loadtime_benchmark.sh`
 Logs: `benchmark/loadtime_results/`
 #### 2.2.2. Results: Model Load + First Token (Seconds, Lower is Better)
 | Model | Vulkan Radv | Vulkan Amdvlk | Rocm6 4 2 | Rocm7 Beta | Rocm7 Rc | Fastest |
 |---|---|---|---|---|---|---|
 | **gemma-3-12b-it-UD-Q8_K_XL** | 4.29s | 3.96s | 6.69s | 3.43s | 3.86s | 🏆 **rocm7_beta** |
 | **gemma-3-27b-it-BF16-00001-of-00002** | 13.58s | ⚠️ Fail | 12.49s | 10.49s | 10.42s | 🏆 **rocm7_rc** |
 | **Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002** | 30.59s | ⚠️ Fail | 35.30s | 30.02s | 26.36s | 🏆 **rocm7_rc** |
 | **Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002** | 30.38s | 30.60s | 31.00s | 32.80s | 32.91s | 🏆 **vulkan_radv** |
 | **Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002** | 32.81s | 35.54s | 31.79s | 28.22s | 28.43s | 🏆 **rocm7_beta** |
 | **Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003** | 41.63s | 47.97s | 40.74s | 36.40s | 35.74s | 🏆 **rocm7_rc** |
 | **Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002** | 20.05s | 16.75s | 15.78s | ⚠️ Fail | 19.36s | 🏆 **rocm6_4_2** |
 | **llama3.3-70.6B-Q4_K_M** | 8.82s | 9.18s | 9.89s | 9.34s | 14.60s | 🏆 **vulkan_radv** |
 | **Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003** | 40.72s | 44.88s | 39.06s | 35.39s | 33.46s | 🏆 **rocm7_rc** |
 | **Qwen3-30B-A3B-BF16-00001-of-00002** | 14.76s | 12.94s | 22.17s | 15.93s | 22.67s | 🏆 **vulkan_amdvlk** |
 | **Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002** | 14.02s | 12.94s | 17.78s | 14.39s | 16.16s | 🏆 **vulkan_amdvlk** |
 ##### Error Legend
 * `⚠️ Fail` — Model failed to load (OOM or crash). May succeed if not under stress/test conditions.
 ---
-## 3. Interpreting the Results & Caveats
+## Benchmark methodology
-* **Vulkan AMDVLK** generally gives the best performance for small/medium models, but ROCm 7.x improves as model size increases.
+* **pp512** — prompt processing throughput (tokens/sec)
-* **Vulkan RADV** is highly reliable and competitive on large models (esp. if AMDVLK fails to load).
+* **tg128** — text generation throughput (tokens/sec)
-* **ROCm** (especially 7.0 RC) delivers the fastest load times for the largest models.
+* Each backend tested twice:
 * Many models that fail under `llama-bench` (e.g., due to GPU hangs or OOM) can sometimes still run interactively (especially outside a stress-test context).
-## 4. How to Reproduce These Benchmarks
+  * FA off: `-fa 0`
  * FA on:  `-fa 1`
 * Winners determined per model using pooled ± error from both results; multiple winners are possible.
-* Place all GGUF models in your `models/` directory.
+Tested backends:
 * Use the scripts from the `benchmark/` folder:
-  * `run_benchmarks.sh` for inference throughput
+* Vulkan RADV
-  * `run_loadtime_benchmark.sh` for loading times
+* Vulkan AMDVLK
-* Output logs and tables will be written in `benchmark/results/` and `benchmark/loadtime_results/`.
+* ROCm 6.4.2
 * ROCm 6.4.2 + rocWMMA
 * ROCm 7.x (beta / rc)
 All runs built from the same llama.cpp commit.
 ---
 ## Running benchmarks
 Place `.gguf` models in `models/` (for sharded models, include only the first shard: `*-00001-of-*.gguf`).
 Run:
 ```bash
 benchmark/run_benchmarks.sh
 ```
 This will:
 * Detect models
 * Execute each backend twice (FA off / FA on)
 * Save logs in `benchmark/results/`
 Generate `results.json` for analysis:
 ```bash
 python benchmark/parse_results_to_json.py
 ```
 Optional: print summary statistics:
 ```bash
 python benchmark/summarize_results.py
 ```
 ---
 ## Summary of current dataset
 ### pp512 (prompt processing)
 * **Vulkan AMDVLK** leads in average throughput and most frequent wins.
  * Winner count: AMDVLK (FA on) – 11 models; AMDVLK (FA off) – 3 models.
  * Average t/s: AMDVLK (FA off) – 422.46; AMDVLK (FA on) – 388.68.
 * **Vulkan RADV** is competitive and shows wins on multiple models.
  * Winner count: RADV (FA on) – 3 models.
  * Average t/s: RADV (FA on) – 279.95; RADV (FA off) – 273.54.
 * **ROCm 6.4.2 + rocWMMA** is strong in some cases.
  * Winner count: 2 models (FA on).
  * Average t/s: rocWMMA (FA on) – 335.44.
 * ROCm 7.x variants trail in pp512 averages.
 **Conclusion:** AMDVLK is generally fastest for prompt processing. RADV is close on certain models and is less prone to instability. ROCm+rocWMMA can match or exceed in select cases but is inconsistent.
 ---
 ### tg128 (text generation)
 * **Vulkan RADV** shows the most frequent wins.
  * Winner count: RADV (FA off) – 6 models; RADV (FA on) – 5 models.
  * Average t/s: RADV (FA off) – 23.73; RADV (FA on) – 23.45.
 * **Vulkan AMDVLK** wins in some cases but is less dominant than in pp512.
  * Winner count: AMDVLK (FA off) – 4 models.
  * Average t/s: AMDVLK (FA off) – 25.91; AMDVLK (FA on) – 23.85.
 * **ROCm 6.4.2 + rocWMMA** achieves the highest average t/s.
  * Average t/s: rocWMMA (FA on) – 32.51; rocWMMA (FA off) – 31.96.
 * ROCm 7.x and ROCm 6.4.2 also appear among winners in several models.
 **Conclusion:** RADV is the most consistent for text generation wins. ROCm+rocWMMA delivers the highest averages but with potential stability issues. AMDVLK is competitive but not consistently ahead.
 ---
 ## Flash Attention (FA)
 FA effects vary:
 * In pp512 averages, AMDVLK performs better without FA.
 * In tg128, the effect depends on backend and model.
  FA should be treated as a per-model tuning parameter rather than enabled or disabled globally.
 ---
 ## Recommendations
 * **Stability priority:** Vulkan RADV.
 * **Maximum pp512 throughput:** Vulkan AMDVLK, validate per model.
 * **High tg128 averages:** ROCm 6.4.2 + rocWMMA, test stability.
 * **FA setting:** Evaluate per model/backend using side-by-side comparison.
 ---
 ## Winner calculation
 A backend is a winner if its mean throughput is within the best backend’s pooled ± error margin for that model and test type.