diff --git a/.github/workflows/build_and_publish.yml b/.github/workflows/build_and_publish.yml index 8973d36..ab45acc 100644 --- a/.github/workflows/build_and_publish.yml +++ b/.github/workflows/build_and_publish.yml @@ -28,7 +28,7 @@ jobs: IN='${{ inputs.backends }}' if [[ "$IN" == "all" || -z "$IN" ]]; then - JSON='["rocm-6.4.2","rocm-6.4.2-rocwmma","rocm-6.4.3","rocm-6.4.3-rocwmma","rocm-7rc","rocm-7rc-rocwmma","vulkan-amdvlk","vulkan-radv"]' + JSON='["rocm-6.4.3","rocm-6.4.3-rocwmma","rocm-7rc","rocm-7rc-rocwmma","vulkan-amdvlk","vulkan-radv"]' else # Remove spaces and build JSON array from comma list IN_CLEAN=$(echo "$IN" | tr -d '[:space:]') diff --git a/README.md b/README.md index 115e0a0..10704ba 100644 --- a/README.md +++ b/README.md @@ -47,18 +47,16 @@ You can check the containers on DockerHub: https://hub.docker.com/r/kyuz0/amd-st | -------------------- | ------------------------ | --------------- | | `vulkan-amdvlk` | Vulkan (AMDVLK) | Fastest backend—AMD open-source driver. ≤2 GiB single buffer allocation limit, some large models won't load. | | `vulkan-radv` | Vulkan (Mesa RADV) | Most stable and compatible. Recommended for most users and all models. | -| `rocm-6.4.2` | ROCm 6.4.2 (HIP) | Latest stable ROCm. Great for BF16 models. Occasional crashes possible. | -| `rocm-6.4.2-rocwmma` | ROCm 6.4.2 (HIP) + ROCWMMA | ROCm with ROCWMMA enabled for improved flash attention on RDNA3+/CDNA. | | `rocm-6.4.3` | ROCm 6.4.3 (HIP) + hipBLASLt* | Latest stable ROCm. Great for BF16 models. Occasional crashes possible. | | `rocm-6.4.3-rocwmma` | ROCm 6.4.3 (HIP) + ROCWMMA + hipBLASLt* | ROCm with ROCWMMA enabled for improved flash attention on RDNA3+/CDNA. | -| `rocm-7rc` | ROCm 7.0 RC (HIP) + hipBLASLt* | Release candidate for ROCm 7.0. Same behavior as beta. | +| `rocm-7rc` | ROCm 7.0 RC (HIP) + hipBLASLt* | Release candidate for ROCm 7.0. | | `rocm-7rc-rocwmma` | ROCm 7.0 RC (HIP) + ROCWMMA + hipBLASLt* | Release candidate for ROCm 7.0, with hipBLASLt and ROCWMMA for improved flash attention on RDNA3+/CDNA | \* All these toolboxes now export `ROCBLAS_USE_HIPBLASLT=1` as this currently results in better perfromance and stability in *MOST* cases. > These containers are **automatically** rebuilt whenever the Llama.cpp master branch is updated, ensuring you get the latest bug fixes and new model support. The easiest way to update to the newest versions is by running the `refresh-toolboxes.sh` [script below](#211-toolbox-refresh-script-automatic-updates). -> *Each container is based on Fedora Rawhide and is built for maximum compatibility and performance on Strix Halo.* +> *rocm-6.4.2* and *rocm-7beta* coontainers have been retired in favour of *rocm-6.4.3* and *rocm_7rc*. --- @@ -80,8 +78,8 @@ To use Llama.cpp with hardware acceleration inside a toolbox container, you must * **For ROCm:** You must expose both `/dev/dri` and `/dev/kfd`, and add the user to extra groups for compute access. ```sh - toolbox create llama-rocm-6.4.2 \ - --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.2 \ + toolbox create llama-rocm-6.4.3-rocwmma \ + --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.3-rocwmma \ -- --device /dev/dri --device /dev/kfd \ --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined ``` @@ -114,7 +112,7 @@ This will: You can also refresh just one or more toolboxes: ```bash -./refreshtoolboxes.sh llama-vulkan-amdvlk llama-rocm-6.4.2 +./refreshtoolboxes.sh llama-vulkan-radv llama-rocm-6.4.3-rocwmma ``` ### 2.2 Running models inside the toolboxes @@ -150,39 +148,38 @@ HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3-Coder-30B-A3B ## 3. Performance Benchmarks (Key Results) -Benchmarks were run on **AMD Ryzen AI Max “Strix Halo”** across all supported backends, testing both **prompt processing (PP)** and **token generation (TG)** throughput. -Reported values were analysed using error margins (mean ± σ). Backends whose ranges overlapped were treated as statistical ties rather than hard wins. - 🌐 Interactive exploration of the latest benchmark runs: [Interactie Benchmark Viewer](https://kyuz0.github.io/amd-strix-halo-toolboxes/) +Benchmarks were analysed with **error-aware ties** (mean ± σ). If two backends overlap within margins, they are treated as a tie. All placement counts below use **Flash Attention ON**. -| Workload Focus | 🏆 Recommended Backend/Config | Win + Tie Count¹ | Typical Runner-Up | Stability Notes | -| ------------------------------------------------- | ----------------------------------- | ---------------: | ---------------------------------- | ------------------------------------------------------------------------------------- | -| **Prompt processing** (pp512, Flash Attention ON) | **ROCm 7 RC + ROCWMMA + hipBLASLt** | 15 | Vulkan AMDVLK (4) | 0% errors in tests | -| **Token generation** (tg128, Flash Attention ON) | **Vulkan RADV** | 13 | Vulkan AMDVLK (1) | 0% errors in tests | -| **Balanced workloads** | **Vulkan AMDVLK** | — | RADV / ROCm 7 RC+ROCWMMA+hipBLASLt | Fast PP & decent TG; \~5.6 % load failure rate due to ≤ 2 GiB single-allocation limit | -| **BF16 models** | **ROCm 7 RC + ROCWMMA + hipBLASLt** | — | ROCm 6.4.2 + ROCWMMA | Best PP & TG among ROCm backends; stable with Flash Attention ON | +**Prompt Processing (pp512)** +| Backend | 1st | 2nd | 3rd | +| --- | ---: | ---: | ---: | +| ROCm 6.4.3 + ROCWMMA (hipBLASLt) | 9 | 5 | 0 | +| ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 3 | 3 | 8 | +| Vulkan AMDVLK | 3 | 0 | 2 | +| ROCm 7 RC + ROCWMMA + hipBLASLt | 1 | 8 | 4 | +| ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 0 | 0 | 1 | +| Vulkan RADV | 0 | 0 | 1 | -¹ Counts show number of times the backend placed 1st (alone or tied) across tested models/quantisations. +**Token Generation (tg128)** +| Backend | 1st | 2nd | 3rd | +| --- | ---: | ---: | ---: | +| Vulkan RADV | 13 | 0 | 0 | +| ROCm 6.4.3 (hipBLASLt) | 3 | 0 | 1 | +| ROCm 6.4.3 + ROCWMMA (hipBLASLt) | 1 | 4 | 3 | +| ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 1 | 2 | 4 | +| ROCm 6.4.3 (hipBLASLt OFF) | 1 | 1 | 1 | +| ROCm 7 RC (hipBLASLt OFF) | 1 | 1 | 1 | +| ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 1 | 1 | 1 | +| ROCm 7 RC (hipBLASLt) | 1 | 0 | 4 | +| Vulkan AMDVLK | 0 | 10 | 0 | +| ROCm 7 RC + ROCWMMA + hipBLASLt | 0 | 1 | 2 | - -### Key take-aways - -* **ROCm 7 RC + ROCWMMA + hipBLASLt + Flash Attention ON** - * Fastest prompt processing in the vast majority of tests (15/22 wins or ties). - * Best ROCm option for BF16 models. - * Zero recorded errors with Flash Attention ON. - -* **Vulkan RADV** - * Best token generation throughput (13/15 wins or ties). - * Most stable and broadly compatible backend overall. - -* **Vulkan AMDVLK** - - * Competitive in both PP and TG; benefits from margin-aware tie handling. - * Limited by ≤ 2 GiB single buffer allocation, which can block some model architectures. - * Other ROCm variants (beta, hblt0, 6.4.2 w/o ROCWMMA) - * Inconsistent performance and/or higher error rates; best suited for experimental use. +### Summary & Recommendations +- **Fastest prompt processing:** ROCm 6.4.3 + ROCWMMA (hipBLASLt) (most 1st-place finishes). +- **Fastest token generation:** Vulkan RADV (most 1st-place finishes). +- **Balanced choice:** ROCm 6.4.3 + ROCWMMA (hipBLASLt) (consistently near the top across PP/TG). 📄 Full per-model analysis: [docs/benchmarks.md](docs/benchmarks.md) diff --git a/benchmark/generate_markdown_results.py b/benchmark/generate_markdown_results.py new file mode 100644 index 0000000..8651b79 --- /dev/null +++ b/benchmark/generate_markdown_results.py @@ -0,0 +1,571 @@ +#!/usr/bin/env python3 +""" +gen_benchmarks_md.py — Generate Markdown for README + detailed benchmarks from results.json + +Defaults: +- Input JSON: ../docs/results.json +- Outputs: ./README_benchmarks_section.md and ./benchmarks_generated.md +""" + +from __future__ import annotations +import json +import argparse +import statistics as stats +from pathlib import Path +from collections import defaultdict +from typing import Dict, List, Tuple, Optional + +# === ENV LABELS === +ENV_LABEL: Dict[str, str] = { + # ROCm 7 RC + "rocm7_rc-rocwmma": "ROCm 7 RC + ROCWMMA + hipBLASLt", + "rocm7_rc": "ROCm 7 RC (hipBLASLt)", + "rocm7_rc-hblt0": "ROCm 7 RC (hipBLASLt OFF)", + "rocm7_rc-rocwmma-hblt0": "ROCm 7 RC + ROCWMMA (hipBLASLt OFF)", + + # ROCm 6.4.3 + "rocm6_4_3": "ROCm 6.4.3 (hipBLASLt)", + "rocm6_4_3-hblt0": "ROCm 6.4.3 (hipBLASLt OFF)", + "rocm6_4_3-rocwmma": "ROCm 6.4.3 + ROCWMMA (hipBLASLt)", + "rocm6_4_3-rocwmma-hblt0": "ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF)", + + # Vulkan + "vulkan_amdvlk": "Vulkan AMDVLK", + "vulkan_radv": "Vulkan RADV", +} + +TESTS = ["pp512", "tg128"] + +def md_row(values: List[str]) -> str: + return "| " + " | ".join(values) + " |" + + +def load_results(path: Path) -> Dict: + data = json.loads(path.read_text()) + assert "runs" in data and isinstance(data["runs"], list), "results.json must have a top-level 'runs' list" + return data + + +def envs_present(runs: List[Dict], only_env: Optional[List[str]], include_all_envs: bool) -> List[str]: + present = {r.get("env") for r in runs if r.get("env")} + if only_env: + present = present.intersection(set(only_env)) + if include_all_envs: + # Include even if not present (might appear 0 rows in tables) + envs = [e for e in ENV_LABEL.keys() if (not only_env or e in only_env)] + else: + envs = [e for e in ENV_LABEL.keys() if e in present and (not only_env or e in only_env)] + return envs + + +def fa_to_filter(fa: str) -> Optional[bool]: + fa = fa.lower().strip() + if fa == "on": + return True + if fa == "off": + return False + if fa == "any": + return None + raise ValueError("--fa must be on/off/any") + + +def margin_aware_placements( + runs: List[Dict], + envs: List[str], + test_filter: str, + fa_filter: Optional[bool] +) -> Tuple[Dict[str, Dict[str, int]], int]: + """ + Returns (placements, sample_count) + placements[env] -> {"first": n, "second": n, "third": n} + sample_count = number of model+quant comparisons considered + """ + placements = defaultdict(lambda: {"first": 0, "second": 0, "third": 0}) + # group by (model, quant) + grouped = defaultdict(list) + for r in runs: + if r.get("error"): + continue + if r.get("test") != test_filter: + continue + if fa_filter is not None and r.get("fa") != fa_filter: + continue + if r.get("env") not in envs: + continue + key = (r.get("model_clean"), r.get("quant")) + grouped[key].append(r) + + samples = 0 + for key, entries in grouped.items(): + # collate by env + env_groups = defaultdict(list) + for e in entries: + env_groups[e["env"]].append(e) + env_list = [e for e in envs if e in env_groups] # keep requested order + if len(env_list) < 2: + continue + + # summarize median mean ± median err per env + summary = {} + for env in env_list: + means = [x["tps_mean"] for x in env_groups[env] if x.get("tps_mean") is not None] + errs = [x.get("tps_err", 0.0) or 0.0 for x in env_groups[env]] + if not means: + continue + m = stats.median(means) + e = stats.median(errs) if errs else 0.0 + summary[env] = (m - e, m + e, m) + if len(summary) < 2: + continue + + samples += 1 + + # rank with overlap -> ties share rank + remaining = [env for env, _ in sorted(summary.items(), key=lambda kv: kv[1][2], reverse=True)] + assigned = {} + current_rank = 1 + while remaining and current_rank <= 3: + env0 = remaining[0] + low0, high0, _ = summary[env0] + tied = [env0] + for env in remaining[1:]: + low, high, _ = summary[env] + if not (low > high0 or high < low0): # overlap -> tie + tied.append(env) + for env in tied: + assigned[env] = current_rank + remaining = [e for e in remaining if e not in tied] + current_rank += 1 + + for env, rk in assigned.items(): + if rk == 1: + placements[env]["first"] += 1 + elif rk == 2: + placements[env]["second"] += 1 + elif rk == 3: + placements[env]["third"] += 1 + + return placements, samples + + +def pairwise_win_counts(runs: List[Dict], envA: str, envB: str, test: str, fa_filter: Optional[bool]) -> Tuple[int, int, int, int]: + A = {} + B = {} + for r in runs: + if r.get("error") or r.get("test") != test: + continue + if fa_filter is not None and r.get("fa") != fa_filter: + continue + key = (r.get("model_clean"), r.get("quant")) + if r.get("env") == envA: + A[key] = r["tps_mean"] + elif r.get("env") == envB: + B[key] = r["tps_mean"] + winsA = winsB = ties = 0 + for k in (set(A) & set(B)): + if A[k] > B[k]: + winsA += 1 + elif B[k] > A[k]: + winsB += 1 + else: + ties += 1 + total = winsA + winsB + ties + return winsA, winsB, ties, total + + +def average_ranks(place_dict: Dict[str, Dict[str, int]]) -> Dict[str, Optional[float]]: + avg = {} + for env, c in place_dict.items(): + total = c.get("first", 0) + c.get("second", 0) + c.get("third", 0) + if total == 0: + avg[env] = None + else: + avg[env] = round((1 * c.get("first", 0) + 2 * c.get("second", 0) + 3 * c.get("third", 0)) / total, 2) + return avg + + +def flash_attention_effect(runs: List[Dict], envs: List[str]) -> Dict[str, Dict[str, Dict[str, float]]]: + """ + Returns: effects[env][test] = {n_pairs, median_pct, min, max} + Based on paired model+quant runs (ON vs OFF). + """ + model_pairs = defaultdict(lambda: defaultdict(dict)) # (env,test)->(model,quant)->{fa: tps} + for r in runs: + if r.get("error") or r.get("tps_mean") is None: + continue + if r.get("test") not in TESTS: + continue + if r.get("env") not in envs: + continue + model_key = (r.get("model_clean"), r.get("quant")) + model_pairs[(r["env"], r["test"])][model_key][r.get("fa")] = r["tps_mean"] + + summary = defaultdict(dict) + for (env, test), d in model_pairs.items(): + deltas = [] + for mk, vals in d.items(): + if True in vals and False in vals and vals[False] > 0: + deltas.append((vals[True] - vals[False]) / vals[False] * 100.0) + if deltas: + summary[env][test] = { + "n_pairs": len(deltas), + "median_pct": round(stats.median(deltas), 1), + "min": round(min(deltas), 1), + "max": round(max(deltas), 1), + } + return summary + + +def rocwmma_effect(runs: List[Dict], pairs_to_compare: List[Tuple[str, str, str]], tests: List[str]) -> List[Tuple[str, str, str, str, int, float]]: + """ + Compare ROCWMMA ON vs OFF with same hipBLASLt state. + Returns rows of (context_label, test, env_on, env_off, n_pairs, median_delta_pct) + where delta_pct = median(ON/OFF - 1)*100 over common model+quant. + """ + rows = [] + for env_on, env_off, label in pairs_to_compare: + for test in tests: + data_on = defaultdict(list) + data_off = defaultdict(list) + for r in runs: + if r.get("error") or r.get("test") != test: + continue + if r.get("env") == env_on: + data_on[(r.get("model_clean"), r.get("quant"))].append(r["tps_mean"]) + elif r.get("env") == env_off: + data_off[(r.get("model_clean"), r.get("quant"))].append(r["tps_mean"]) + common = sorted(set(data_on) & set(data_off)) + if not common: + continue + ratios = [] + for k in common: + aon = stats.median(data_on[k]) + aoff = stats.median(data_off[k]) + if aoff > 0: + ratios.append(aon / aoff - 1.0) + if ratios: + rows.append((label, test, env_on, env_off, len(ratios), round(100 * stats.median(ratios), 1))) + return rows + + +def hipblaslt_effect(runs: List[Dict], pairs_to_compare: List[Tuple[str, str, str]], tests: List[str]) -> List[Tuple[str, str, str, str, int, float]]: + """ + Compare hipBLASLt ON vs OFF with same ROCWMMA state. + Returns rows of (context_label, test, env_on, env_off, n_pairs, median_delta_pct) + where delta_pct = median(ON/OFF - 1)*100 over common model+quant. + """ + rows = [] + for env_on, env_off, label in pairs_to_compare: + for test in tests: + data_on = defaultdict(list) + data_off = defaultdict(list) + for r in runs: + if r.get("error") or r.get("test") != test: + continue + if r.get("env") == env_on: + data_on[(r.get("model_clean"), r.get("quant"))].append(r["tps_mean"]) + elif r.get("env") == env_off: + data_off[(r.get("model_clean"), r.get("quant"))].append(r["tps_mean"]) + common = sorted(set(data_on) & set(data_off)) + if not common: + continue + ratios = [] + for k in common: + aon = stats.median(data_on[k]) + aoff = stats.median(data_off[k]) + if aoff > 0: + ratios.append(aon / aoff - 1.0) + if ratios: + rows.append((label, test, env_on, env_off, len(ratios), round(100 * stats.median(ratios), 1))) + return rows + + +def amdvlk_vs_radv(runs: List[Dict], fa_filter: Optional[bool]) -> List[Tuple[str, int, int, int, int]]: + rows = [] + for test in TESTS: + wa, wr, ties, total = pairwise_win_counts(runs, "vulkan_amdvlk", "vulkan_radv", test, fa_filter) + rows.append((test, wa, wr, ties, total)) + return rows + + +def winners(place_dict: Dict[str, Dict[str, int]], slot="first") -> Tuple[List[str], int]: + max_count = max((c.get(slot, 0) for c in place_dict.values()), default=0) + win_list = [env for env, c in place_dict.items() if c.get(slot, 0) == max_count and max_count > 0] + return win_list, max_count + + +def human_list(envs: List[str]) -> str: + return ", ".join(ENV_LABEL.get(e, e) for e in envs) if envs else "—" + + +def build_readme_section( + envs: List[str], + pp_place: Dict[str, Dict[str, int]], + tg_place: Dict[str, Dict[str, int]], + fa_filter: Optional[bool] +) -> str: + # Winners + pp_wins, _ = winners(pp_place, "first") + tg_wins, _ = winners(tg_place, "first") + + lines: List[str] = [] + lines.append("## 3. Performance Benchmarks (Key Results)") + lines.append("") + lines.append("🌐 Interactive exploration of the latest benchmark runs: [Interactie Benchmark Viewer](https://kyuz0.github.io/amd-strix-halo-toolboxes/)") + lines.append("") + lines.append("Benchmarks were analysed with **error-aware ties** (mean ± σ). If two backends overlap within margins, they are treated as a tie. All placement counts below use **Flash Attention ON**.") + lines.append("") + + # Placement tables + def place_table(title: str, place_dict: Dict[str, Dict[str, int]]): + lines.append(f"**{title}**") + lines.append(md_row(["Backend", "1st", "2nd", "3rd"])) + lines.append(md_row(["---", "---:", "---:", "---:"])) + order = sorted(place_dict.items(), key=lambda kv: (-kv[1].get("first", 0), -kv[1].get("second", 0), kv[0])) + for env, c in order: + lines.append(md_row([ENV_LABEL.get(env, env), str(c.get("first", 0)), str(c.get("second", 0)), str(c.get("third", 0))])) + lines.append("") + + place_table("Prompt Processing (pp512)", pp_place) + place_table("Token Generation (tg128)", tg_place) + + # Data-driven recommendations + def total_score(c: Dict[str, int]) -> int: + # weight 1st more than 2nd + return c.get("first", 0) * 2 + c.get("second", 0) + + best_bal_score = -1 + balanced: List[str] = [] + for env in envs: + score = total_score(pp_place.get(env, {})) + total_score(tg_place.get(env, {})) + if score > best_bal_score: + best_bal_score = score + balanced = [env] + elif score == best_bal_score: + balanced.append(env) + + lines.append("### Summary & Recommendations") + lines.append(f"- **Fastest prompt processing:** {human_list(pp_wins)} (most 1st-place finishes).") + lines.append(f"- **Fastest token generation:** {human_list(tg_wins)} (most 1st-place finishes).") + lines.append(f"- **Balanced choice:** {human_list(balanced)} (consistently near the top across PP/TG).") + lines.append("") + lines.append("> **Note (ROCm 7):** Toolboxes enable **hipBLASLt** by default. The benchmark suite also runs **hipBLASLt OFF** variants to show its impact.") + return "\n".join(lines) + + +def build_benchmarks_doc( + runs: List[Dict], + envs: List[str], + pp_place: Dict[str, Dict[str, int]], + tg_place: Dict[str, Dict[str, int]], + fa_filter: Optional[bool], +) -> str: + lines: List[str] = [] + lines.append("# AMD Strix Halo — llama.cpp Toolboxes (Benchmarks)") + lines.append("") + lines.append("**Interactive results:** https://kyuz0.github.io/amd-strix-halo-toolboxes/") + lines.append("") + lines.append("## Table of Contents") + lines.append("- [Benchmark methodology](#benchmark-methodology)") + lines.append("- [Summary of current dataset (Flash Attention ON)](#summary-of-current-dataset-flash-attention-on)") + lines.append(" - [Placement counts](#placement-counts)") + lines.append(" - [Pairwise head-to-head wins](#pairwise-head-to-head-wins)") + lines.append(" - [Average ranks](#average-ranks)") + lines.append("- [Analyses by feature](#analyses-by-feature)") + lines.append(" - [Impact of Flash Attention](#impact-of-flash-attention)") + lines.append(" - [Impact of ROCWMMA](#impact-of-rocwmma)") + lines.append(" - [Impact of hipBLASLt](#impact-of-hipblaslt)") + lines.append(" - [Vulkan: AMDVLK vs RADV](#vulkan-amdvlk-vs-radv)") + lines.append("- [Recommendations](#recommendations)") + lines.append("- [Winner calculation](#winner-calculation)") + lines.append("") + lines.append("---") + lines.append("") + lines.append("## Benchmark methodology") + lines.append("") + lines.append("- **pp512** — prompt processing throughput (tokens/sec, prefill)") + lines.append("- **tg128** — token generation throughput (tokens/sec, interactive)") + lines.append("- Each backend tested twice per model: `-fa 0` and `-fa 1`") + lines.append("- Winners per model/test are **margin-aware**; multiple winners are possible when mean±σ overlap") + lines.append("- Built from the same llama.cpp commit for consistency") + lines.append("") + lines.append("**Backends in this dataset:** " + ", ".join(ENV_LABEL.get(e, e) for e in envs)) + lines.append("") + lines.append("**ROCm 7 hipBLASLt policy:** Toolboxes ship with **hipBLASLt enabled** by default (`ROCBLAS_USE_HIPBLASLT=1`). The benchmark script also runs **hipBLASLt OFF** variants (`-hblt0`) to measure its effect.") + lines.append("") + lines.append("---") + lines.append("") + lines.append("## Summary of current dataset (Flash Attention ON)") + lines.append("") + # Placement counts + lines.append("### Placement counts") + def place_block(title: str, place_dict: Dict[str, Dict[str, int]]): + lines.append(f"**{title}**") + lines.append(md_row(["Backend", "1st", "2nd", "3rd"])) + lines.append(md_row(["---", "---:", "---:", "---:"])) + order = sorted(place_dict.items(), key=lambda kv: (-kv[1].get("first", 0), -kv[1].get("second", 0), kv[0])) + for env, c in order: + lines.append(md_row([ENV_LABEL.get(env, env), str(c.get("first", 0)), str(c.get("second", 0)), str(c.get("third", 0))])) + lines.append("") + place_block("Prompt Processing (pp512)", pp_place) + place_block("Token Generation (tg128)", tg_place) + + # Pairwise wins + lines.append("### Pairwise head-to-head wins") + lines.append("For any model+quant where both backends succeeded, this counts who was faster (ties when equal).") + lines.append(md_row(["Comparison", "Test", "A wins", "B wins", "Ties", "Total"])) + lines.append(md_row(["---", "---", "---:", "---:", "---:", "---:"])) + pairs = [ + ("ROCm 7 RC + ROCWMMA + hipBLASLt", "Vulkan AMDVLK", "rocm7_rc-rocwmma", "vulkan_amdvlk"), + ("ROCm 7 RC + ROCWMMA + hipBLASLt", "Vulkan RADV", "rocm7_rc-rocwmma", "vulkan_radv"), + ("Vulkan AMDVLK", "Vulkan RADV", "vulkan_amdvlk", "vulkan_radv"), + ] + for labelA, labelB, envA, envB in pairs: + for test in TESTS: + a, b, t, total = pairwise_win_counts(runs, envA, envB, test, fa_filter) + lines.append(md_row([f"{labelA} vs {labelB}", test, str(a), str(b), str(t), str(total)])) + lines.append("") + + # Average ranks + lines.append("### Average ranks") + avg_pp = average_ranks(pp_place) + avg_tg = average_ranks(tg_place) + lines.append("**Prompt Processing (pp512)**") + lines.append(md_row(["Backend", "Avg Rank (↓ is better)"])) + lines.append(md_row(["---", "---:"])) + for env, val in sorted(avg_pp.items(), key=lambda kv: (kv[1] is None, kv[1] or 99)): + lines.append(md_row([ENV_LABEL.get(env, env), str(val) if val is not None else "—"])) + lines.append("") + lines.append("**Token Generation (tg128)**") + lines.append(md_row(["Backend", "Avg Rank (↓ is better)"])) + lines.append(md_row(["---", "---:"])) + for env, val in sorted(avg_tg.items(), key=lambda kv: (kv[1] is None, kv[1] or 99)): + lines.append(md_row([ENV_LABEL.get(env, env), str(val) if val is not None else "—"])) + lines.append("") + lines.append("---") + lines.append("") + lines.append("## Analyses by feature") + lines.append("") + + # Flash Attention effect + lines.append("### Impact of Flash Attention") + fa_eff = flash_attention_effect(runs, envs) + lines.append("Median % change when **Flash Attention ON vs OFF**, paired by model+quant, per backend:") + lines.append(md_row(["Backend", "pp512 Δ% (median, min..max, n)", "tg128 Δ% (median, min..max, n)"])) + lines.append(md_row(["---", "---", "---"])) + def fmt_eff(row: Optional[Dict[str, float]]) -> str: + return f"{row['median_pct']}% ({row['min']}..{row['max']}), n={row['n_pairs']}" if row else "—" + for env in envs: + row_pp = fa_eff.get(env, {}).get("pp512") + row_tg = fa_eff.get(env, {}).get("tg128") + lines.append(md_row([ENV_LABEL.get(env, env), fmt_eff(row_pp), fmt_eff(row_tg)])) + lines.append("") + + # ROCWMMA effect — check both ROCm 7 and 6.4.3 families if present + lines.append("### Impact of ROCWMMA") + rocwmma_pairs = [] + if "rocm7_rc-rocwmma" in envs and "rocm7_rc" in envs: + rocwmma_pairs.append(("rocm7_rc-rocwmma", "rocm7_rc", "ROCm 7 RC (hipBLASLt)")) + if "rocm7_rc-rocwmma-hblt0" in envs and "rocm7_rc-hblt0" in envs: + rocwmma_pairs.append(("rocm7_rc-rocwmma-hblt0", "rocm7_rc-hblt0", "ROCm 7 RC (hipBLASLt OFF)")) + if "rocm6_4_3-rocwmma" in envs and "rocm6_4_3" in envs: + rocwmma_pairs.append(("rocm6_4_3-rocwmma", "rocm6_4_3", "ROCm 6.4.3 (hipBLASLt)")) + if "rocm6_4_3-rocwmma-hblt0" in envs and "rocm6_4_3-hblt0" in envs: + rocwmma_pairs.append(("rocm6_4_3-rocwmma-hblt0", "rocm6_4_3-hblt0", "ROCm 6.4.3 (hipBLASLt OFF)")) + + rocwmma_rows = rocwmma_effect(runs, rocwmma_pairs, TESTS) + lines.append(md_row(["Context", "Test", "Compared Envs", "Pairs", "Median Δ%"])) + lines.append(md_row(["---", "---", "---", "---:", "---:"])) + for label, test, env_on, env_off, n, delta in rocwmma_rows: + lines.append(md_row([label, test, f"{ENV_LABEL.get(env_on, env_on)} vs {ENV_LABEL.get(env_off, env_off)}", str(n), f"{delta}%"])) + lines.append("") + + # hipBLASLt effect — for both ROCm 7 and 6.4.3 families + lines.append("### Impact of hipBLASLt") + hip_pairs = [] + if "rocm7_rc" in envs and "rocm7_rc-hblt0" in envs: + hip_pairs.append(("rocm7_rc", "rocm7_rc-hblt0", "ROCm 7 RC (no ROCWMMA)")) + if "rocm7_rc-rocwmma" in envs and "rocm7_rc-rocwmma-hblt0" in envs: + hip_pairs.append(("rocm7_rc-rocwmma", "rocm7_rc-rocwmma-hblt0", "ROCm 7 RC + ROCWMMA")) + if "rocm6_4_3" in envs and "rocm6_4_3-hblt0" in envs: + hip_pairs.append(("rocm6_4_3", "rocm6_4_3-hblt0", "ROCm 6.4.3 (no ROCWMMA)")) + if "rocm6_4_3-rocwmma" in envs and "rocm6_4_3-rocwmma-hblt0" in envs: + hip_pairs.append(("rocm6_4_3-rocwmma", "rocm6_4_3-rocwmma-hblt0", "ROCm 6.4.3 + ROCWMMA")) + + hip_rows = hipblaslt_effect(runs, hip_pairs, TESTS) + lines.append(md_row(["Context", "Test", "Compared Envs", "Pairs", "Median Δ%"])) + lines.append(md_row(["---", "---", "---", "---:", "---:"])) + for label, test, env_on, env_off, n, delta in hip_rows: + lines.append(md_row([label, test, f"{ENV_LABEL.get(env_on, env_on)} vs {ENV_LABEL.get(env_off, env_off)}", str(n), f"{delta}%"])) + lines.append("") + + # AMDVLK vs RADV + lines.append("### Vulkan: AMDVLK vs RADV") + lines.append("Head-to-head wins with selected Flash Attention filter:") + lines.append(md_row(["Test", "AMDVLK wins", "RADV wins", "Ties", "Total"])) + lines.append(md_row(["---", "---:", "---:", "---:", "---:"])) + for test, wa, wr, t, total in amdvlk_vs_radv(runs, fa_filter): + lines.append(md_row([test, str(wa), str(wr), str(t), str(total)])) + lines.append("") + lines.append("---") + lines.append("") + lines.append("## Recommendations") + pp_wins, _ = winners(pp_place, "first") + tg_wins, _ = winners(tg_place, "first") + lines.append(f"- **Fastest prompt processing:** {human_list(pp_wins)} (most 1st-place finishes with selected Flash Attention filter).") + lines.append(f"- **Fastest token generation:** {human_list(tg_wins)} (most 1st-place finishes with selected Flash Attention filter).") + # Balanced: highest (2*first + second) across PP+TG + def score(c: Dict[str, int]) -> int: + return c.get("first", 0) * 2 + c.get("second", 0) + best_bal = -1 + balanced: List[str] = [] + for env in envs: + s = score(pp_place.get(env, {})) + score(tg_place.get(env, {})) + if s > best_bal: + best_bal = s + balanced = [env] + elif s == best_bal: + balanced.append(env) + lines.append(f"- **Balanced choice:** {human_list(balanced)} (consistently near the top across PP/TG).") + lines.append("") + lines.append("---") + lines.append("") + lines.append("## Winner calculation") + lines.append("A backend is counted as a winner if its mean throughput is within the best backend’s pooled ± error margin for that model/test type. This treats results within measurement noise as ties instead of false losses.") + return "\n".join(lines) + +def main(): + ap = argparse.ArgumentParser() + ap.add_argument("--file", type=Path, default=Path("../docs/results.json"), + help="Path to results.json (default: ../docs/results.json)") + ap.add_argument("--out-readme", type=Path, default=Path("./README_benchmarks_section.md"), + help="Path to write README section Markdown (default: ./README_benchmarks_section.md)") + ap.add_argument("--out-bench", type=Path, default=Path("./benchmarks_generated.md"), + help="Path to write detailed benchmarks Markdown (default: ./benchmarks_generated.md)") + ap.add_argument("--fa", choices=["on", "off", "any"], default="on", + help="Flash Attention filter (default: on)") + ap.add_argument("--include-all-envs", action="store_true", + help="Include envs even if not present in results.json") + ap.add_argument("--only-env", action="append", + help="Restrict analysis to specific env keys (repeatable)") + args = ap.parse_args() + + data = load_results(args.file) + runs: List[Dict] = data["runs"] + fa_filter = fa_to_filter(args.fa) + envs = envs_present(runs, args.only_env, args.include_all_envs) + + pp_place, _ = margin_aware_placements(runs, envs, "pp512", fa_filter) + tg_place, _ = margin_aware_placements(runs, envs, "tg128", fa_filter) + + readme_md = build_readme_section(envs, pp_place, tg_place, fa_filter) + args.out_readme.write_text(readme_md) + + bench_md = build_benchmarks_doc(runs, envs, pp_place, tg_place, fa_filter) + args.out_bench.write_text(bench_md) + + print(f"Wrote:\n - {args.out_readme}\n - {args.out_bench}") + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/benchmark/generate_readme_summary.py b/benchmark/generate_readme_summary.py deleted file mode 100644 index 8a87d7b..0000000 --- a/benchmark/generate_readme_summary.py +++ /dev/null @@ -1,175 +0,0 @@ -#!/usr/bin/env python3 -import json, re -from collections import defaultdict -from pathlib import Path - -RESULTS_FILE = "../docs/results.json" - -# Column order + labels -ENV_ORDER = [ - "vulkan_amdvlk", - "vulkan_radv", - "rocm6_4_2", - "rocm6_4_2-rocwmma", - "rocm7_beta", - "rocm7_rc", -] -COL_NAMES = { - "vulkan_amdvlk": "Vulkan (AMDVLK)", - "vulkan_radv": "Vulkan (RADV)", - "rocm6_4_2": "ROCm 6.4.2", - "rocm6_4_2-rocwmma": "ROCm 6.4.2 + ROCWMMA", - "rocm7_beta": "ROCm 7.0 Beta", - "rocm7_rc": "ROCm 7.0 RC", -} -WINNER_NAMES = { - "vulkan_amdvlk": "AMDVLK", - "vulkan_radv": "RADV", - "rocm6_4_2": "ROCm6.4.2", - "rocm6_4_2-rocwmma": "ROCm6.4.2+ROCWMMA", - "rocm7_beta": "ROCm7 Beta", - "rocm7_rc": "ROCm7 RC", -} -ERROR_LABEL = { - "load": "⚠️ Load Error", - "hang": "⚠️ GPU Hang", - "runtime": "⚠️ Runtime Error", -} - -DEFAULT_MODELS = [ - ("Gemma3 12B Q8_0", "gemma-3-12b"), - ("Gemma3 27B BF16", "gemma-3-27b"), - ("Llama-4-Scout 17B Q8_0", "llama-4-scout-17b-16e-instruct-q8_0"), - ("Llama-4-Scout 17B Q4_K XL", "llama-4-scout-17b-16e-instruct-q4_k_xl"), - ("Qwen3 30B BF16", "qwen3-30b-a3b-bf16"), - ("Qwen3-235B Q3_K XL", "qwen3-235b-a22b"), - ("GLM-4.5-Air-Q4_K_XL", "glm-4.5-air-q4_k_xl"), - ("GLM-4.5-Air-Q6_K_XL", "glm-4.5-air-q6_k_xl"), - ("gpt-oss-120b-mxfp4", "gpt-oss-120b-mxfp4"), - ("gpt-oss-20b-mxfp4", "gpt-oss-20b-mxfp4"), -] - -SHARD_RE = re.compile(r"-000\d+-of-000\d+", re.IGNORECASE) -def norm_model(s: str) -> str: - s = (s or "").lower().replace("_", "-") - s = SHARD_RE.sub("", s) - s = s.replace("-ud", "") - return s - -raw = json.loads(Path(RESULTS_FILE).read_text(encoding="utf-8")) -runs = raw["runs"] - -buckets = defaultdict(list) -error_only = defaultdict(list) -all_models = set() - -for r in runs: - env = r.get("env") - if env not in ENV_ORDER: - continue - mkey = norm_model(r.get("model_clean") or r.get("model") or "") - all_models.add(mkey) - test = r.get("test") - if test in ("pp512", "tg128"): - buckets[(mkey, env, test)].append(r) - else: - if r.get("error"): - error_only[(mkey, env)].append(r.get("error_type") or "runtime") - -def pick_best(rows): - best, best_val, fallback = None, -1, None - for r in rows: - if r.get("error"): - fallback = r - continue - v = r.get("tps_mean") - if isinstance(v, (int, float)) and v > best_val: - best_val, best = v, r - return best or fallback - -chosen = defaultdict(lambda: defaultdict(dict)) -for (mkey, env, test), rows in buckets.items(): - chosen_row = pick_best(rows) - chosen[mkey][env][test] = chosen_row - -for (mkey, env), etypes in error_only.items(): - if etypes: - if "load" in etypes: - chosen[mkey][env]["error_only"] = "load" - elif "hang" in etypes: - chosen[mkey][env]["error_only"] = "hang" - else: - chosen[mkey][env]["error_only"] = "runtime" - -def fa_tag(row): - if not row or row.get("error"): - return "" - fa = row.get("fa") - if fa is None: - return "" - return " (FA on)" if fa else " (FA off)" - -def format_cell(entry_dict): - pp = entry_dict.get("pp512") - tg = entry_dict.get("tg128") - for row in (pp, tg): - if row and row.get("error"): - return ERROR_LABEL.get(row.get("error_type") or "runtime", "⚠️ Error") - if not pp and not tg: - et = entry_dict.get("error_only") - if et: - return ERROR_LABEL.get(et, "⚠️ Error") - return "—" - def fmt(v): - return f"{int(round(v))}" if isinstance(v, (int, float)) else "—" - ppv = pp.get("tps_mean") if pp else None - tgv = tg.get("tps_mean") if tg else None - pp_suffix = fa_tag(pp) - tg_suffix = fa_tag(tg) - if isinstance(tgv, (int, float)): - return f"{fmt(ppv)} pp{pp_suffix} / {tgv:.1f} tg{tg_suffix}" - else: - return f"{fmt(ppv)} pp{pp_suffix} / — tg" - -def best_env_for(mkey, test): - best_env, best_val, best_row = None, -1, None - for env in ENV_ORDER: - row = chosen[mkey].get(env, {}).get(test) - if not row or row.get("error"): - continue - v = row.get("tps_mean") - if isinstance(v, (int, float)) and v > best_val: - best_env, best_val, best_row = env, v, row - return best_env, (best_row.get("fa") if best_row else None) - -def win_label(env, fa): - if not env: - return "—" - base = WINNER_NAMES[env] - if fa is None: - return f"🏆 **{base}**" - return f"🏆 **{base}** ({'FA on' if fa else 'FA off'})" - -def find_model_key(fuzzy): - needle = norm_model(fuzzy) - for k in all_models: - if needle in k: - return k - return None - -# Header now has Best PP & Best TG right after Model -header = ["Model", "🏆 Best PP", "🏆 Best TG"] + [COL_NAMES[e] for e in ENV_ORDER] -print("| " + " | ".join(header) + " |") -print("|" + "|".join(["---"] * len(header)) + "|") - -for disp, fuzzy in DEFAULT_MODELS: - mkey = find_model_key(fuzzy) - if not mkey: - print("| " + " | ".join([f"**{disp}**", "—", "—"] + ["—"]*len(ENV_ORDER)) + " |") - continue - bpp_env, bpp_fa = best_env_for(mkey, "pp512") - btg_env, btg_fa = best_env_for(mkey, "tg128") - row = [f"**{disp}**", win_label(bpp_env, bpp_fa), win_label(btg_env, btg_fa)] - for env in ENV_ORDER: - row.append(format_cell(chosen[mkey].get(env, {}))) - print("| " + " | ".join(row) + " |") diff --git a/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm6_4_2.log b/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm6_4_2.log deleted file mode 100644 index f09c0a1..0000000 --- a/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm6_4_2.log +++ /dev/null @@ -1,172 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /home/kyuz0/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen2 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Kimi-Dev-72B -llama_model_loader: - kv 3: general.basename str = Kimi-Dev-72B -llama_model_loader: - kv 4: general.quantized_by str = Unsloth -llama_model_loader: - kv 5: general.size_label str = 72B -llama_model_loader: - kv 6: general.license str = mit -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: general.base_model.count u32 = 1 -llama_model_loader: - kv 9: general.base_model.0.name str = Kimi Dev 72B -llama_model_loader: - kv 10: general.base_model.0.organization str = Moonshotai -llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/moonshotai/Kim... -llama_model_loader: - kv 12: general.tags arr[str,5] = ["code", "unsloth", "swebench", "soft... -llama_model_loader: - kv 13: qwen2.block_count u32 = 80 -llama_model_loader: - kv 14: qwen2.context_length u32 = 131072 -llama_model_loader: - kv 15: qwen2.embedding_length u32 = 8192 -llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 29568 -llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 64 -llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: general.file_type u32 = 7 -llama_model_loader: - kv 32: quantize.imatrix.file str = Kimi-Dev-72B-GGUF/imatrix_unsloth.dat -llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Kimi-Dev-72B.txt -llama_model_loader: - kv 34: quantize.imatrix.entries_count u32 = 560 -llama_model_loader: - kv 35: quantize.imatrix.chunks_count u32 = 685 -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.tensors.count i32 = 963 -llama_model_loader: - kv 38: split.count u16 = 2 -llama_model_loader: - type f32: 401 tensors -llama_model_loader: - type f16: 107 tensors -llama_model_loader: - type q8_0: 455 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 78.21 GiB (9.24 BPW) -load: special tokens cache size = 22 -load: token to piece cache size = 0.9310 MB -print_info: arch = qwen2 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 29568 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = -1 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 72.71 B -print_info: general.name = Kimi-Dev-72B -print_info: vocab type = BPE -print_info: n_vocab = 152064 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 80 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 81/81 layers to GPU -load_tensors: ROCm0 model buffer size = 77715.11 MiB -load_tensors: ROCm_Host model buffer size = 2376.00 MiB -................................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB -llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 313.00 MiB -llama_context: ROCm_Host compute buffer size = 8.01 MiB -llama_context: graph nodes = 2887 -llama_context: graph splits = 1 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 1808727616 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello0 - -llama_perf_sampler_print: sampling time = 0.06 ms / 2 runs ( 0.03 ms per token, 31746.03 tokens per second) -llama_perf_context_print: load time = 31744.47 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 463.93 ms / 1 runs ( 463.93 ms per token, 2.16 tokens per second) -llama_perf_context_print: total time = 470.35 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 36.639378936s - Run #3 status: 0 - → Avg over 3 runs: 35.301s diff --git a/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm7_beta.log b/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm7_beta.log deleted file mode 100644 index 0006a09..0000000 --- a/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm7_beta.log +++ /dev/null @@ -1,172 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /home/kyuz0/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen2 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Kimi-Dev-72B -llama_model_loader: - kv 3: general.basename str = Kimi-Dev-72B -llama_model_loader: - kv 4: general.quantized_by str = Unsloth -llama_model_loader: - kv 5: general.size_label str = 72B -llama_model_loader: - kv 6: general.license str = mit -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: general.base_model.count u32 = 1 -llama_model_loader: - kv 9: general.base_model.0.name str = Kimi Dev 72B -llama_model_loader: - kv 10: general.base_model.0.organization str = Moonshotai -llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/moonshotai/Kim... -llama_model_loader: - kv 12: general.tags arr[str,5] = ["code", "unsloth", "swebench", "soft... -llama_model_loader: - kv 13: qwen2.block_count u32 = 80 -llama_model_loader: - kv 14: qwen2.context_length u32 = 131072 -llama_model_loader: - kv 15: qwen2.embedding_length u32 = 8192 -llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 29568 -llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 64 -llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: general.file_type u32 = 7 -llama_model_loader: - kv 32: quantize.imatrix.file str = Kimi-Dev-72B-GGUF/imatrix_unsloth.dat -llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Kimi-Dev-72B.txt -llama_model_loader: - kv 34: quantize.imatrix.entries_count u32 = 560 -llama_model_loader: - kv 35: quantize.imatrix.chunks_count u32 = 685 -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.tensors.count i32 = 963 -llama_model_loader: - kv 38: split.count u16 = 2 -llama_model_loader: - type f32: 401 tensors -llama_model_loader: - type f16: 107 tensors -llama_model_loader: - type q8_0: 455 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 78.21 GiB (9.24 BPW) -load: special tokens cache size = 22 -load: token to piece cache size = 0.9310 MB -print_info: arch = qwen2 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 29568 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = -1 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 72.71 B -print_info: general.name = Kimi-Dev-72B -print_info: vocab type = BPE -print_info: n_vocab = 152064 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 80 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 81/81 layers to GPU -load_tensors: ROCm0 model buffer size = 77715.11 MiB -load_tensors: ROCm_Host model buffer size = 2376.00 MiB -................................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB -llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 313.00 MiB -llama_context: ROCm_Host compute buffer size = 8.01 MiB -llama_context: graph nodes = 2887 -llama_context: graph splits = 1 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 3691857665 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello0 - -llama_perf_sampler_print: sampling time = 0.07 ms / 2 runs ( 0.04 ms per token, 27027.03 tokens per second) -llama_perf_context_print: load time = 30932.72 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 559.63 ms / 1 runs ( 559.63 ms per token, 1.79 tokens per second) -llama_perf_context_print: total time = 566.03 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 32.156014765s - Run #3 status: 0 - → Avg over 3 runs: 30.024s diff --git a/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm7_rc.log b/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm7_rc.log deleted file mode 100644 index cd42a6f..0000000 --- a/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm7_rc.log +++ /dev/null @@ -1,172 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /home/kyuz0/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen2 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Kimi-Dev-72B -llama_model_loader: - kv 3: general.basename str = Kimi-Dev-72B -llama_model_loader: - kv 4: general.quantized_by str = Unsloth -llama_model_loader: - kv 5: general.size_label str = 72B -llama_model_loader: - kv 6: general.license str = mit -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: general.base_model.count u32 = 1 -llama_model_loader: - kv 9: general.base_model.0.name str = Kimi Dev 72B -llama_model_loader: - kv 10: general.base_model.0.organization str = Moonshotai -llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/moonshotai/Kim... -llama_model_loader: - kv 12: general.tags arr[str,5] = ["code", "unsloth", "swebench", "soft... -llama_model_loader: - kv 13: qwen2.block_count u32 = 80 -llama_model_loader: - kv 14: qwen2.context_length u32 = 131072 -llama_model_loader: - kv 15: qwen2.embedding_length u32 = 8192 -llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 29568 -llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 64 -llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: general.file_type u32 = 7 -llama_model_loader: - kv 32: quantize.imatrix.file str = Kimi-Dev-72B-GGUF/imatrix_unsloth.dat -llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Kimi-Dev-72B.txt -llama_model_loader: - kv 34: quantize.imatrix.entries_count u32 = 560 -llama_model_loader: - kv 35: quantize.imatrix.chunks_count u32 = 685 -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.tensors.count i32 = 963 -llama_model_loader: - kv 38: split.count u16 = 2 -llama_model_loader: - type f32: 401 tensors -llama_model_loader: - type f16: 107 tensors -llama_model_loader: - type q8_0: 455 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 78.21 GiB (9.24 BPW) -load: special tokens cache size = 22 -load: token to piece cache size = 0.9310 MB -print_info: arch = qwen2 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 29568 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = -1 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 72.71 B -print_info: general.name = Kimi-Dev-72B -print_info: vocab type = BPE -print_info: n_vocab = 152064 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 80 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 81/81 layers to GPU -load_tensors: ROCm0 model buffer size = 77715.11 MiB -load_tensors: ROCm_Host model buffer size = 2376.00 MiB -................................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB -llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 313.00 MiB -llama_context: ROCm_Host compute buffer size = 8.01 MiB -llama_context: graph nodes = 2887 -llama_context: graph splits = 1 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 3133611532 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello0 - -llama_perf_sampler_print: sampling time = 0.06 ms / 2 runs ( 0.03 ms per token, 35087.72 tokens per second) -llama_perf_context_print: load time = 25127.98 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 383.37 ms / 1 runs ( 383.37 ms per token, 2.61 tokens per second) -llama_perf_context_print: total time = 389.90 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 26.238043008s - Run #3 status: 0 - → Avg over 3 runs: 26.362s diff --git a/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__vulkan_amdvlk.log b/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__vulkan_amdvlk.log deleted file mode 100644 index ffaaa4a..0000000 --- a/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__vulkan_amdvlk.log +++ /dev/null @@ -1,123 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat -build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /home/kyuz0/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen2 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Kimi-Dev-72B -llama_model_loader: - kv 3: general.basename str = Kimi-Dev-72B -llama_model_loader: - kv 4: general.quantized_by str = Unsloth -llama_model_loader: - kv 5: general.size_label str = 72B -llama_model_loader: - kv 6: general.license str = mit -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: general.base_model.count u32 = 1 -llama_model_loader: - kv 9: general.base_model.0.name str = Kimi Dev 72B -llama_model_loader: - kv 10: general.base_model.0.organization str = Moonshotai -llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/moonshotai/Kim... -llama_model_loader: - kv 12: general.tags arr[str,5] = ["code", "unsloth", "swebench", "soft... -llama_model_loader: - kv 13: qwen2.block_count u32 = 80 -llama_model_loader: - kv 14: qwen2.context_length u32 = 131072 -llama_model_loader: - kv 15: qwen2.embedding_length u32 = 8192 -llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 29568 -llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 64 -llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: general.file_type u32 = 7 -llama_model_loader: - kv 32: quantize.imatrix.file str = Kimi-Dev-72B-GGUF/imatrix_unsloth.dat -llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Kimi-Dev-72B.txt -llama_model_loader: - kv 34: quantize.imatrix.entries_count u32 = 560 -llama_model_loader: - kv 35: quantize.imatrix.chunks_count u32 = 685 -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.tensors.count i32 = 963 -llama_model_loader: - kv 38: split.count u16 = 2 -llama_model_loader: - type f32: 401 tensors -llama_model_loader: - type f16: 107 tensors -llama_model_loader: - type q8_0: 455 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 78.21 GiB (9.24 BPW) -load: special tokens cache size = 22 -load: token to piece cache size = 0.9310 MB -print_info: arch = qwen2 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 29568 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = -1 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 72.71 B -print_info: general.name = Kimi-Dev-72B -print_info: vocab type = BPE -print_info: n_vocab = 152064 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -ggml_vulkan: Device memory allocation of size 2491416576 failed. -ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory -alloc_tensor_range: failed to allocate Vulkan0 buffer of size 2491416576 -llama_model_load: error loading model: unable to allocate Vulkan0 buffer -llama_model_load_from_file_impl: failed to load model -common_init_from_params: failed to load model '/home/kyuz0/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf' -main: error: unable to load model - Elapsed #3: .334893088s - Run #3 status: 1 - ✖ run #3 failed - → No successful runs diff --git a/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__vulkan_radv.log b/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__vulkan_radv.log deleted file mode 100644 index dd58c7a..0000000 --- a/benchmark/loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__vulkan_radv.log +++ /dev/null @@ -1,170 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /home/kyuz0/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen2 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Kimi-Dev-72B -llama_model_loader: - kv 3: general.basename str = Kimi-Dev-72B -llama_model_loader: - kv 4: general.quantized_by str = Unsloth -llama_model_loader: - kv 5: general.size_label str = 72B -llama_model_loader: - kv 6: general.license str = mit -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: general.base_model.count u32 = 1 -llama_model_loader: - kv 9: general.base_model.0.name str = Kimi Dev 72B -llama_model_loader: - kv 10: general.base_model.0.organization str = Moonshotai -llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/moonshotai/Kim... -llama_model_loader: - kv 12: general.tags arr[str,5] = ["code", "unsloth", "swebench", "soft... -llama_model_loader: - kv 13: qwen2.block_count u32 = 80 -llama_model_loader: - kv 14: qwen2.context_length u32 = 131072 -llama_model_loader: - kv 15: qwen2.embedding_length u32 = 8192 -llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 29568 -llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 64 -llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: general.file_type u32 = 7 -llama_model_loader: - kv 32: quantize.imatrix.file str = Kimi-Dev-72B-GGUF/imatrix_unsloth.dat -llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Kimi-Dev-72B.txt -llama_model_loader: - kv 34: quantize.imatrix.entries_count u32 = 560 -llama_model_loader: - kv 35: quantize.imatrix.chunks_count u32 = 685 -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.tensors.count i32 = 963 -llama_model_loader: - kv 38: split.count u16 = 2 -llama_model_loader: - type f32: 401 tensors -llama_model_loader: - type f16: 107 tensors -llama_model_loader: - type q8_0: 455 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 78.21 GiB (9.24 BPW) -load: special tokens cache size = 22 -load: token to piece cache size = 0.9310 MB -print_info: arch = qwen2 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 29568 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = -1 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 72.71 B -print_info: general.name = Kimi-Dev-72B -print_info: vocab type = BPE -print_info: n_vocab = 152064 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 80 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 81/81 layers to GPU -load_tensors: Vulkan0 model buffer size = 77715.09 MiB -load_tensors: Vulkan_Host model buffer size = 2376.00 MiB -................................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: Vulkan0 KV buffer size = 1280.00 MiB -llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 313.00 MiB -llama_context: Vulkan_Host compute buffer size = 24.01 MiB -llama_context: graph nodes = 2887 -llama_context: graph splits = 2 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 4071074447 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello beğen - -llama_perf_sampler_print: sampling time = 0.05 ms / 2 runs ( 0.03 ms per token, 37037.04 tokens per second) -llama_perf_context_print: load time = 29902.30 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 392.32 ms / 1 runs ( 392.32 ms per token, 2.55 tokens per second) -llama_perf_context_print: total time = 399.50 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 30.654893638s - Run #3 status: 0 - → Avg over 3 runs: 30.591s diff --git a/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2.log b/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2.log deleted file mode 100644 index fa41d28..0000000 --- a/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2.log +++ /dev/null @@ -1,163 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-3.3-70B-Instruct -llama_model_loader: - kv 3: general.finetune str = Instruct -llama_model_loader: - kv 4: general.basename str = Llama-3.3-70B-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 70B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: llama.block_count u32 = 80 -llama_model_loader: - kv 9: llama.context_length u32 = 131072 -llama_model_loader: - kv 10: llama.embedding_length u32 = 8192 -llama_model_loader: - kv 11: llama.feed_forward_length u32 = 28672 -llama_model_loader: - kv 12: llama.attention.head_count u32 = 64 -llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 -llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 -llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 -llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 -llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe -llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 -llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 -llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 128004 -llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: general.file_type u32 = 7 -llama_model_loader: - kv 32: quantize.imatrix.file str = Llama-3.3-70B-Instruct-GGUF/imatrix_u... -llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Llama-3.3-70B-Ins... -llama_model_loader: - kv 34: quantize.imatrix.entries_count i32 = 560 -llama_model_loader: - kv 35: quantize.imatrix.chunks_count i32 = 689 -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.tensors.count i32 = 724 -llama_model_loader: - kv 38: split.count u16 = 2 -llama_model_loader: - type f32: 162 tensors -llama_model_loader: - type q8_0: 455 tensors -llama_model_loader: - type bf16: 107 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 75.65 GiB (9.21 BPW) -load: special tokens cache size = 256 -load: token to piece cache size = 0.7999 MB -print_info: arch = llama -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 28672 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 70.55 B -print_info: general.name = Llama-3.3-70B-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 128256 -print_info: n_merges = 280147 -print_info: BOS token = 128000 '<|begin_of_text|>' -print_info: EOS token = 128009 '<|eot_id|>' -print_info: EOT token = 128009 '<|eot_id|>' -print_info: EOM token = 128008 '<|eom_id|>' -print_info: PAD token = 128004 '<|finetune_right_pad_id|>' -print_info: LF token = 198 'Ċ' -print_info: EOG token = 128001 '<|end_of_text|>' -print_info: EOG token = 128008 '<|eom_id|>' -print_info: EOG token = 128009 '<|eot_id|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 80 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 81/81 layers to GPU -load_tensors: ROCm0 model buffer size = 75456.53 MiB -load_tensors: ROCm_Host model buffer size = 2004.00 MiB -................................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.49 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB -llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 266.50 MiB -llama_context: ROCm_Host compute buffer size = 8.01 MiB -llama_context: graph nodes = 2647 -llama_context: graph splits = 1 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eom_id|> logit bias = -inf -common_init_from_params: added <|eot_id|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 192699360 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello, - -llama_perf_sampler_print: sampling time = 0.05 ms / 3 runs ( 0.02 ms per token, 63829.79 tokens per second) -llama_perf_context_print: load time = 24487.91 ms -llama_perf_context_print: prompt eval time = 368.54 ms / 2 tokens ( 184.27 ms per token, 5.43 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 383.50 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 28.922457711s - Run #3 status: 0 - → Avg over 3 runs: 30.998s diff --git a/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm7_beta.log b/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm7_beta.log deleted file mode 100644 index 611a7c5..0000000 --- a/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm7_beta.log +++ /dev/null @@ -1,163 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-3.3-70B-Instruct -llama_model_loader: - kv 3: general.finetune str = Instruct -llama_model_loader: - kv 4: general.basename str = Llama-3.3-70B-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 70B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: llama.block_count u32 = 80 -llama_model_loader: - kv 9: llama.context_length u32 = 131072 -llama_model_loader: - kv 10: llama.embedding_length u32 = 8192 -llama_model_loader: - kv 11: llama.feed_forward_length u32 = 28672 -llama_model_loader: - kv 12: llama.attention.head_count u32 = 64 -llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 -llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 -llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 -llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 -llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe -llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 -llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 -llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 128004 -llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: general.file_type u32 = 7 -llama_model_loader: - kv 32: quantize.imatrix.file str = Llama-3.3-70B-Instruct-GGUF/imatrix_u... -llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Llama-3.3-70B-Ins... -llama_model_loader: - kv 34: quantize.imatrix.entries_count i32 = 560 -llama_model_loader: - kv 35: quantize.imatrix.chunks_count i32 = 689 -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.tensors.count i32 = 724 -llama_model_loader: - kv 38: split.count u16 = 2 -llama_model_loader: - type f32: 162 tensors -llama_model_loader: - type q8_0: 455 tensors -llama_model_loader: - type bf16: 107 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 75.65 GiB (9.21 BPW) -load: special tokens cache size = 256 -load: token to piece cache size = 0.7999 MB -print_info: arch = llama -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 28672 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 70.55 B -print_info: general.name = Llama-3.3-70B-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 128256 -print_info: n_merges = 280147 -print_info: BOS token = 128000 '<|begin_of_text|>' -print_info: EOS token = 128009 '<|eot_id|>' -print_info: EOT token = 128009 '<|eot_id|>' -print_info: EOM token = 128008 '<|eom_id|>' -print_info: PAD token = 128004 '<|finetune_right_pad_id|>' -print_info: LF token = 198 'Ċ' -print_info: EOG token = 128001 '<|end_of_text|>' -print_info: EOG token = 128008 '<|eom_id|>' -print_info: EOG token = 128009 '<|eot_id|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 80 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 81/81 layers to GPU -load_tensors: ROCm0 model buffer size = 75456.53 MiB -load_tensors: ROCm_Host model buffer size = 2004.00 MiB -................................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.49 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB -llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 266.50 MiB -llama_context: ROCm_Host compute buffer size = 8.01 MiB -llama_context: graph nodes = 2647 -llama_context: graph splits = 1 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eom_id|> logit bias = -inf -common_init_from_params: added <|eot_id|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 3478849877 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello H - -llama_perf_sampler_print: sampling time = 0.06 ms / 3 runs ( 0.02 ms per token, 53571.43 tokens per second) -llama_perf_context_print: load time = 32005.62 ms -llama_perf_context_print: prompt eval time = 456.36 ms / 2 tokens ( 228.18 ms per token, 4.38 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 471.29 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 33.222127697s - Run #3 status: 0 - → Avg over 3 runs: 32.796s diff --git a/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm7_rc.log b/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm7_rc.log deleted file mode 100644 index f6dd5ab..0000000 --- a/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm7_rc.log +++ /dev/null @@ -1,163 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-3.3-70B-Instruct -llama_model_loader: - kv 3: general.finetune str = Instruct -llama_model_loader: - kv 4: general.basename str = Llama-3.3-70B-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 70B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: llama.block_count u32 = 80 -llama_model_loader: - kv 9: llama.context_length u32 = 131072 -llama_model_loader: - kv 10: llama.embedding_length u32 = 8192 -llama_model_loader: - kv 11: llama.feed_forward_length u32 = 28672 -llama_model_loader: - kv 12: llama.attention.head_count u32 = 64 -llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 -llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 -llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 -llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 -llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe -llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 -llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 -llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 128004 -llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: general.file_type u32 = 7 -llama_model_loader: - kv 32: quantize.imatrix.file str = Llama-3.3-70B-Instruct-GGUF/imatrix_u... -llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Llama-3.3-70B-Ins... -llama_model_loader: - kv 34: quantize.imatrix.entries_count i32 = 560 -llama_model_loader: - kv 35: quantize.imatrix.chunks_count i32 = 689 -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.tensors.count i32 = 724 -llama_model_loader: - kv 38: split.count u16 = 2 -llama_model_loader: - type f32: 162 tensors -llama_model_loader: - type q8_0: 455 tensors -llama_model_loader: - type bf16: 107 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 75.65 GiB (9.21 BPW) -load: special tokens cache size = 256 -load: token to piece cache size = 0.7999 MB -print_info: arch = llama -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 28672 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 70.55 B -print_info: general.name = Llama-3.3-70B-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 128256 -print_info: n_merges = 280147 -print_info: BOS token = 128000 '<|begin_of_text|>' -print_info: EOS token = 128009 '<|eot_id|>' -print_info: EOT token = 128009 '<|eot_id|>' -print_info: EOM token = 128008 '<|eom_id|>' -print_info: PAD token = 128004 '<|finetune_right_pad_id|>' -print_info: LF token = 198 'Ċ' -print_info: EOG token = 128001 '<|end_of_text|>' -print_info: EOG token = 128008 '<|eom_id|>' -print_info: EOG token = 128009 '<|eot_id|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 80 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 81/81 layers to GPU -load_tensors: ROCm0 model buffer size = 75456.53 MiB -load_tensors: ROCm_Host model buffer size = 2004.00 MiB -................................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.49 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB -llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 266.50 MiB -llama_context: ROCm_Host compute buffer size = 8.01 MiB -llama_context: graph nodes = 2647 -llama_context: graph splits = 1 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eom_id|> logit bias = -inf -common_init_from_params: added <|eot_id|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 4130863841 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello: - -llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 44117.65 tokens per second) -llama_perf_context_print: load time = 32184.35 ms -llama_perf_context_print: prompt eval time = 697.57 ms / 2 tokens ( 348.79 ms per token, 2.87 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 712.61 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 33.659541277s - Run #3 status: 0 - → Avg over 3 runs: 32.911s diff --git a/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__vulkan_amdvlk.log b/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__vulkan_amdvlk.log deleted file mode 100644 index 3a9005c..0000000 --- a/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__vulkan_amdvlk.log +++ /dev/null @@ -1,161 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat -build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-3.3-70B-Instruct -llama_model_loader: - kv 3: general.finetune str = Instruct -llama_model_loader: - kv 4: general.basename str = Llama-3.3-70B-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 70B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: llama.block_count u32 = 80 -llama_model_loader: - kv 9: llama.context_length u32 = 131072 -llama_model_loader: - kv 10: llama.embedding_length u32 = 8192 -llama_model_loader: - kv 11: llama.feed_forward_length u32 = 28672 -llama_model_loader: - kv 12: llama.attention.head_count u32 = 64 -llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 -llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 -llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 -llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 -llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe -llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 -llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 -llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 128004 -llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: general.file_type u32 = 7 -llama_model_loader: - kv 32: quantize.imatrix.file str = Llama-3.3-70B-Instruct-GGUF/imatrix_u... -llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Llama-3.3-70B-Ins... -llama_model_loader: - kv 34: quantize.imatrix.entries_count i32 = 560 -llama_model_loader: - kv 35: quantize.imatrix.chunks_count i32 = 689 -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.tensors.count i32 = 724 -llama_model_loader: - kv 38: split.count u16 = 2 -llama_model_loader: - type f32: 162 tensors -llama_model_loader: - type q8_0: 455 tensors -llama_model_loader: - type bf16: 107 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 75.65 GiB (9.21 BPW) -load: special tokens cache size = 256 -load: token to piece cache size = 0.7999 MB -print_info: arch = llama -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 28672 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 70.55 B -print_info: general.name = Llama-3.3-70B-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 128256 -print_info: n_merges = 280147 -print_info: BOS token = 128000 '<|begin_of_text|>' -print_info: EOS token = 128009 '<|eot_id|>' -print_info: EOT token = 128009 '<|eot_id|>' -print_info: EOM token = 128008 '<|eom_id|>' -print_info: PAD token = 128004 '<|finetune_right_pad_id|>' -print_info: LF token = 198 'Ċ' -print_info: EOG token = 128001 '<|end_of_text|>' -print_info: EOG token = 128008 '<|eom_id|>' -print_info: EOG token = 128009 '<|eot_id|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 80 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 81/81 layers to GPU -load_tensors: Vulkan0 model buffer size = 75456.53 MiB -load_tensors: Vulkan_Host model buffer size = 2004.00 MiB -................................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.49 MiB -llama_kv_cache_unified: Vulkan0 KV buffer size = 1280.00 MiB -llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 266.50 MiB -llama_context: Vulkan_Host compute buffer size = 24.01 MiB -llama_context: graph nodes = 2647 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eom_id|> logit bias = -inf -common_init_from_params: added <|eot_id|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 327404797 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello, - -llama_perf_sampler_print: sampling time = 0.06 ms / 3 runs ( 0.02 ms per token, 50847.46 tokens per second) -llama_perf_context_print: load time = 26953.87 ms -llama_perf_context_print: prompt eval time = 387.45 ms / 2 tokens ( 193.72 ms per token, 5.16 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 404.05 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 28.173844492s - Run #3 status: 0 - → Avg over 3 runs: 30.604s diff --git a/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__vulkan_radv.log b/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__vulkan_radv.log deleted file mode 100644 index c33e52c..0000000 --- a/benchmark/loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__vulkan_radv.log +++ /dev/null @@ -1,161 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-3.3-70B-Instruct -llama_model_loader: - kv 3: general.finetune str = Instruct -llama_model_loader: - kv 4: general.basename str = Llama-3.3-70B-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 70B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: llama.block_count u32 = 80 -llama_model_loader: - kv 9: llama.context_length u32 = 131072 -llama_model_loader: - kv 10: llama.embedding_length u32 = 8192 -llama_model_loader: - kv 11: llama.feed_forward_length u32 = 28672 -llama_model_loader: - kv 12: llama.attention.head_count u32 = 64 -llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 -llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 -llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 -llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 -llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe -llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 -llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 -llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 128004 -llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: general.file_type u32 = 7 -llama_model_loader: - kv 32: quantize.imatrix.file str = Llama-3.3-70B-Instruct-GGUF/imatrix_u... -llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Llama-3.3-70B-Ins... -llama_model_loader: - kv 34: quantize.imatrix.entries_count i32 = 560 -llama_model_loader: - kv 35: quantize.imatrix.chunks_count i32 = 689 -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.tensors.count i32 = 724 -llama_model_loader: - kv 38: split.count u16 = 2 -llama_model_loader: - type f32: 162 tensors -llama_model_loader: - type q8_0: 455 tensors -llama_model_loader: - type bf16: 107 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 75.65 GiB (9.21 BPW) -load: special tokens cache size = 256 -load: token to piece cache size = 0.7999 MB -print_info: arch = llama -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 28672 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 70.55 B -print_info: general.name = Llama-3.3-70B-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 128256 -print_info: n_merges = 280147 -print_info: BOS token = 128000 '<|begin_of_text|>' -print_info: EOS token = 128009 '<|eot_id|>' -print_info: EOT token = 128009 '<|eot_id|>' -print_info: EOM token = 128008 '<|eom_id|>' -print_info: PAD token = 128004 '<|finetune_right_pad_id|>' -print_info: LF token = 198 'Ċ' -print_info: EOG token = 128001 '<|end_of_text|>' -print_info: EOG token = 128008 '<|eom_id|>' -print_info: EOG token = 128009 '<|eot_id|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 80 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 81/81 layers to GPU -load_tensors: Vulkan0 model buffer size = 75456.53 MiB -load_tensors: Vulkan_Host model buffer size = 2004.00 MiB -................................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.49 MiB -llama_kv_cache_unified: Vulkan0 KV buffer size = 1280.00 MiB -llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 266.50 MiB -llama_context: Vulkan_Host compute buffer size = 24.01 MiB -llama_context: graph nodes = 2647 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eom_id|> logit bias = -inf -common_init_from_params: added <|eot_id|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 2154218339 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello’s - -llama_perf_sampler_print: sampling time = 0.06 ms / 3 runs ( 0.02 ms per token, 51724.14 tokens per second) -llama_perf_context_print: load time = 29443.29 ms -llama_perf_context_print: prompt eval time = 376.13 ms / 2 tokens ( 188.07 ms per token, 5.32 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 392.17 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 30.227365941s - Run #3 status: 0 - → Avg over 3 runs: 30.376s diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2.log deleted file mode 100644 index 87d6d92..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2.log +++ /dev/null @@ -1,179 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 18 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 2 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q6_K: 482 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q6_K -print_info: file size = 82.35 GiB (6.56 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: CPU model buffer size = 809.29 MiB -load_tensors: ROCm0 model buffer size = 83513.68 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 442.62 MiB -llama_context: ROCm_Host compute buffer size = 26.01 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 1642319140 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello - -llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 42857.14 tokens per second) -llama_perf_context_print: load time = 26639.60 ms -llama_perf_context_print: prompt eval time = 107.52 ms / 2 tokens ( 53.76 ms per token, 18.60 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 127.12 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 30.905590182s - Run #3 status: 0 - → Avg over 3 runs: 31.792s diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm7_beta.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm7_beta.log deleted file mode 100644 index b3a421c..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm7_beta.log +++ /dev/null @@ -1,179 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 18 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 2 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q6_K: 482 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q6_K -print_info: file size = 82.35 GiB (6.56 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: CPU model buffer size = 809.29 MiB -load_tensors: ROCm0 model buffer size = 83513.68 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 442.62 MiB -llama_context: ROCm_Host compute buffer size = 26.01 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 1329865451 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello1 - -llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 44776.12 tokens per second) -llama_perf_context_print: load time = 27337.52 ms -llama_perf_context_print: prompt eval time = 135.84 ms / 2 tokens ( 67.92 ms per token, 14.72 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 155.35 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 28.220065203s - Run #3 status: 0 - → Avg over 3 runs: 28.221s diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm7_rc.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm7_rc.log deleted file mode 100644 index 84d5fa3..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm7_rc.log +++ /dev/null @@ -1,179 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 18 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 2 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q6_K: 482 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q6_K -print_info: file size = 82.35 GiB (6.56 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: CPU model buffer size = 809.29 MiB -load_tensors: ROCm0 model buffer size = 83513.68 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 442.62 MiB -llama_context: ROCm_Host compute buffer size = 26.01 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 3194189125 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello: - -llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 46153.85 tokens per second) -llama_perf_context_print: load time = 26424.61 ms -llama_perf_context_print: prompt eval time = 106.73 ms / 2 tokens ( 53.37 ms per token, 18.74 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 126.53 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 27.353142250s - Run #3 status: 0 - → Avg over 3 runs: 28.435s diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__vulkan_amdvlk.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__vulkan_amdvlk.log deleted file mode 100644 index da4b832..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__vulkan_amdvlk.log +++ /dev/null @@ -1,177 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat -build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 18 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 2 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q6_K: 482 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q6_K -print_info: file size = 82.35 GiB (6.56 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: Vulkan0 model buffer size = 83513.68 MiB -load_tensors: CPU model buffer size = 809.29 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 440.63 MiB -llama_context: Vulkan_Host compute buffer size = 26.01 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 4111748233 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello: - -llama_perf_sampler_print: sampling time = 0.15 ms / 3 runs ( 0.05 ms per token, 20134.23 tokens per second) -llama_perf_context_print: load time = 31375.27 ms -llama_perf_context_print: prompt eval time = 267.76 ms / 2 tokens ( 133.88 ms per token, 7.47 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 295.92 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 33.122388042s - Run #3 status: 0 - → Avg over 3 runs: 35.541s diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__vulkan_radv.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__vulkan_radv.log deleted file mode 100644 index acb490f..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__vulkan_radv.log +++ /dev/null @@ -1,177 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 18 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 2 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q6_K: 482 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q6_K -print_info: file size = 82.35 GiB (6.56 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: Vulkan0 model buffer size = 83513.68 MiB -load_tensors: CPU model buffer size = 809.29 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 440.63 MiB -llama_context: Vulkan_Host compute buffer size = 26.02 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 1422642604 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello1 - -llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 32967.03 tokens per second) -llama_perf_context_print: load time = 32072.23 ms -llama_perf_context_print: prompt eval time = 296.78 ms / 2 tokens ( 148.39 ms per token, 6.74 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 324.57 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 32.859879045s - Run #3 status: 0 - → Avg over 3 runs: 32.810s diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2.log deleted file mode 100644 index eaf30ee..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2.log +++ /dev/null @@ -1,179 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free -llama_model_loader: additional 2 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 7 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 3 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q8_0: 482 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 106.65 GiB (8.50 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: ROCm0 model buffer size = 108165.12 MiB -load_tensors: ROCm_Host model buffer size = 1048.22 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 434.62 MiB -llama_context: ROCm_Host compute buffer size = 16.01 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 1 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 2885096603 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello. - -llama_perf_sampler_print: sampling time = 0.06 ms / 3 runs ( 0.02 ms per token, 46875.00 tokens per second) -llama_perf_context_print: load time = 36882.65 ms -llama_perf_context_print: prompt eval time = 127.76 ms / 2 tokens ( 63.88 ms per token, 15.65 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 158.41 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 41.426125320s - Run #3 status: 0 - → Avg over 3 runs: 40.739s diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm7_beta.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm7_beta.log deleted file mode 100644 index 3675c18..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm7_beta.log +++ /dev/null @@ -1,179 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 2 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 7 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 3 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q8_0: 482 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 106.65 GiB (8.50 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: ROCm0 model buffer size = 108165.12 MiB -load_tensors: ROCm_Host model buffer size = 1048.22 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 434.62 MiB -llama_context: ROCm_Host compute buffer size = 16.01 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 1 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 1149431120 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello: - -llama_perf_sampler_print: sampling time = 0.06 ms / 3 runs ( 0.02 ms per token, 48387.10 tokens per second) -llama_perf_context_print: load time = 35959.68 ms -llama_perf_context_print: prompt eval time = 127.62 ms / 2 tokens ( 63.81 ms per token, 15.67 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 157.80 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 36.919182117s - Run #3 status: 0 - → Avg over 3 runs: 36.400s diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm7_rc.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm7_rc.log deleted file mode 100644 index 4673a8a..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm7_rc.log +++ /dev/null @@ -1,179 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 2 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 7 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 3 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q8_0: 482 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 106.65 GiB (8.50 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: ROCm0 model buffer size = 108165.12 MiB -load_tensors: ROCm_Host model buffer size = 1048.22 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 434.62 MiB -llama_context: ROCm_Host compute buffer size = 16.01 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 1 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 406280533 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello The - -llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 45454.55 tokens per second) -llama_perf_context_print: load time = 34222.03 ms -llama_perf_context_print: prompt eval time = 136.79 ms / 2 tokens ( 68.40 ms per token, 14.62 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 156.58 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 35.217307205s - Run #3 status: 0 - → Avg over 3 runs: 35.742s diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__vulkan_amdvlk.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__vulkan_amdvlk.log deleted file mode 100644 index ec3aa5e..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__vulkan_amdvlk.log +++ /dev/null @@ -1,177 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat -build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free -llama_model_loader: additional 2 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 7 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 3 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q8_0: 482 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 106.65 GiB (8.50 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: Vulkan0 model buffer size = 108165.12 MiB -load_tensors: Vulkan_Host model buffer size = 1048.22 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 440.63 MiB -llama_context: Vulkan_Host compute buffer size = 26.01 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 3690416473 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello - -llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 32967.03 tokens per second) -llama_perf_context_print: load time = 41237.01 ms -llama_perf_context_print: prompt eval time = 233.96 ms / 2 tokens ( 116.98 ms per token, 8.55 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 261.97 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 45.548750208s - Run #3 status: 0 - → Avg over 3 runs: 47.967s diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__vulkan_radv.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__vulkan_radv.log deleted file mode 100644 index 48132f0..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__vulkan_radv.log +++ /dev/null @@ -1,177 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free -llama_model_loader: additional 2 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 7 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 3 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q8_0: 482 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 106.65 GiB (8.50 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: Vulkan0 model buffer size = 108165.12 MiB -load_tensors: Vulkan_Host model buffer size = 1048.22 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 440.63 MiB -llama_context: Vulkan_Host compute buffer size = 26.02 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 4068031204 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello - -llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 32967.03 tokens per second) -llama_perf_context_print: load time = 41299.30 ms -llama_perf_context_print: prompt eval time = 252.99 ms / 2 tokens ( 126.49 ms per token, 7.91 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 280.67 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 42.081911936s - Run #3 status: 0 - → Avg over 3 runs: 41.626s diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log deleted file mode 100644 index 73fb564..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log +++ /dev/null @@ -1,181 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 15 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 2 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q4_K: 421 tensors -llama_model_loader: - type q5_K: 43 tensors -llama_model_loader: - type q6_K: 18 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q4_K - Medium -print_info: file size = 57.73 GiB (4.60 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: CPU model buffer size = 554.94 MiB -load_tensors: ROCm0 model buffer size = 58558.57 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 442.62 MiB -llama_context: ROCm_Host compute buffer size = 26.01 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 4182963810 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello The - -llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 46153.85 tokens per second) -llama_perf_context_print: load time = 9663.18 ms -llama_perf_context_print: prompt eval time = 90.98 ms / 2 tokens ( 45.49 ms per token, 21.98 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 110.40 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 13.853856771s - Run #3 status: 0 - → Avg over 3 runs: 15.776s diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm7_beta.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm7_beta.log deleted file mode 100644 index 1fb554b..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm7_beta.log +++ /dev/null @@ -1,162 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 15 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 2 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q4_K: 421 tensors -llama_model_loader: - type q5_K: 43 tensors -llama_model_loader: - type q6_K: 18 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q4_K - Medium -print_info: file size = 57.73 GiB (4.60 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: CPU model buffer size = 554.94 MiB -load_tensors: ROCm0 model buffer size = 58558.57 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 442.62 MiB -llama_context: ROCm_Host compute buffer size = 26.01 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -HW Exception by GPU node-1 (Agent handle: 0x48fa1f0) reason :GPU Hang - Elapsed #3: 22.180402418s - Run #3 status: 134 - ✖ run #3 failed - → No successful runs diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm7_rc.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm7_rc.log deleted file mode 100644 index 9ffcb33..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm7_rc.log +++ /dev/null @@ -1,174 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 15 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 2 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q4_K: 421 tensors -llama_model_loader: - type q5_K: 43 tensors -llama_model_loader: - type q6_K: 18 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q4_K - Medium -print_info: file size = 57.73 GiB (4.60 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: CPU model buffer size = 554.94 MiB -load_tensors: ROCm0 model buffer size = 58558.57 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 442.62 MiB -llama_context: ROCm_Host compute buffer size = 26.01 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 722371466 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello Elapsed #3: 22.602610057s - Run #3 status: 134 - ✖ run #3 failed - → Avg over 2 runs: 19.365s diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__vulkan_amdvlk.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__vulkan_amdvlk.log deleted file mode 100644 index 5fdc5b4..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__vulkan_amdvlk.log +++ /dev/null @@ -1,179 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat -build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 15 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 2 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q4_K: 421 tensors -llama_model_loader: - type q5_K: 43 tensors -llama_model_loader: - type q6_K: 18 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q4_K - Medium -print_info: file size = 57.73 GiB (4.60 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: Vulkan0 model buffer size = 58558.57 MiB -load_tensors: CPU model buffer size = 554.94 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 440.63 MiB -llama_context: Vulkan_Host compute buffer size = 26.01 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 83044290 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello - -llama_perf_sampler_print: sampling time = 0.16 ms / 3 runs ( 0.05 ms per token, 18518.52 tokens per second) -llama_perf_context_print: load time = 13560.35 ms -llama_perf_context_print: prompt eval time = 257.61 ms / 2 tokens ( 128.81 ms per token, 7.76 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 285.54 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 14.548378284s - Run #3 status: 0 - → Avg over 3 runs: 16.752s diff --git a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__vulkan_radv.log b/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__vulkan_radv.log deleted file mode 100644 index 403f25b..0000000 --- a/benchmark/loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__vulkan_radv.log +++ /dev/null @@ -1,179 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama4 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 3: general.finetune str = 16E-Instruct -llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 17B -llama_model_loader: - kv 7: general.license str = other -llama_model_loader: - kv 8: general.license.name str = llama4 -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ... -llama_model_loader: - kv 16: llama4.block_count u32 = 48 -llama_model_loader: - kv 17: llama4.context_length u32 = 10485760 -llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120 -llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384 -llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40 -llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 24: llama4.expert_count u32 = 16 -llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1 -llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128 -llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128 -llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048 -llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128 -llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1 -llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192 -llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4 -llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ... -llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000 -llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008 -llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018 -llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 42: general.quantization_version u32 = 2 -llama_model_loader: - kv 43: general.file_type u32 = 15 -llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i... -llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B... -llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528 -llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729 -llama_model_loader: - kv 48: split.no u16 = 0 -llama_model_loader: - kv 49: split.tensors.count i32 = 628 -llama_model_loader: - kv 50: split.count u16 = 2 -llama_model_loader: - type f32: 146 tensors -llama_model_loader: - type q4_K: 421 tensors -llama_model_loader: - type q5_K: 43 tensors -llama_model_loader: - type q6_K: 18 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q4_K - Medium -print_info: file size = 57.73 GiB (4.60 BPW) -load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect -load: special tokens cache size = 1135 -load: token to piece cache size = 1.3873 MB -print_info: arch = llama4 -print_info: vocab_only = 0 -print_info: n_ctx_train = 10485760 -print_info: n_embd = 5120 -print_info: n_layer = 48 -print_info: n_head = 40 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 8192 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 5 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 16384 -print_info: n_expert = 16 -print_info: n_expert_used = 1 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 10485760 -print_info: rope_finetuned = unknown -print_info: model type = 17Bx16E (Scout) -print_info: model params = 107.77 B -print_info: general.name = Llama-4-Scout-17B-16E-Instruct -print_info: vocab type = BPE -print_info: n_vocab = 202048 -print_info: n_merges = 439802 -print_info: BOS token = 200000 '<|begin_of_text|>' -print_info: EOS token = 200008 '<|eot|>' -print_info: PAD token = 200018 '<|finetune_right_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 200002 '<|fim_prefix|>' -print_info: FIM SUF token = 200004 '<|fim_suffix|>' -print_info: FIM MID token = 200003 '<|fim_middle|>' -print_info: EOG token = 200001 '<|end_of_text|>' -print_info: EOG token = 200008 '<|eot|>' -print_info: max token length = 192 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: Vulkan0 model buffer size = 58558.57 MiB -load_tensors: CPU model buffer size = 554.94 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.77 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 192.00 MiB -llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 576.00 MiB -llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 440.63 MiB -llama_context: Vulkan_Host compute buffer size = 26.02 MiB -llama_context: graph nodes = 2420 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eot|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 2510811977 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello ( - -llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 32608.70 tokens per second) -llama_perf_context_print: load time = 16387.21 ms -llama_perf_context_print: prompt eval time = 291.47 ms / 2 tokens ( 145.73 ms per token, 6.86 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 319.42 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 17.154124582s - Run #3 status: 0 - → Avg over 3 runs: 20.045s diff --git a/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2.log b/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2.log deleted file mode 100644 index 5a96dc9..0000000 --- a/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2.log +++ /dev/null @@ -1,184 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free -llama_model_loader: additional 2 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 48 key-value pairs and 1131 tensors from /home/kyuz0/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-235B-A22B-Instruct-2507 -llama_model_loader: - kv 3: general.version str = 2507 -llama_model_loader: - kv 4: general.finetune str = Instruct -llama_model_loader: - kv 5: general.basename str = Qwen3-235B-A22B-Instruct-2507 -llama_model_loader: - kv 6: general.quantized_by str = Unsloth -llama_model_loader: - kv 7: general.size_label str = 235B-A22B -llama_model_loader: - kv 8: general.license str = apache-2.0 -llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen3-235... -llama_model_loader: - kv 10: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 11: general.base_model.count u32 = 1 -llama_model_loader: - kv 12: general.base_model.0.name str = Qwen3 235B A22B Instruct 2507 -llama_model_loader: - kv 13: general.base_model.0.version str = 2507 -llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen -llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-235... -llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "text-generation"] -llama_model_loader: - kv 17: qwen3moe.block_count u32 = 94 -llama_model_loader: - kv 18: qwen3moe.context_length u32 = 262144 -llama_model_loader: - kv 19: qwen3moe.embedding_length u32 = 4096 -llama_model_loader: - kv 20: qwen3moe.feed_forward_length u32 = 12288 -llama_model_loader: - kv 21: qwen3moe.attention.head_count u32 = 64 -llama_model_loader: - kv 22: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 23: qwen3moe.rope.freq_base f32 = 5000000.000000 -llama_model_loader: - kv 24: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 25: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 26: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 27: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 28: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 29: qwen3moe.expert_feed_forward_length u32 = 1536 -llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 31: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 38: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 39: general.quantization_version u32 = 2 -llama_model_loader: - kv 40: general.file_type u32 = 12 -llama_model_loader: - kv 41: quantize.imatrix.file str = Qwen3-235B-A22B-Instruct-2507-GGUF/im... -llama_model_loader: - kv 42: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-235B-A22B-I... -llama_model_loader: - kv 43: quantize.imatrix.entries_count u32 = 745 -llama_model_loader: - kv 44: quantize.imatrix.chunks_count u32 = 693 -llama_model_loader: - kv 45: split.no u16 = 0 -llama_model_loader: - kv 46: split.tensors.count i32 = 1131 -llama_model_loader: - kv 47: split.count u16 = 3 -llama_model_loader: - type f32: 471 tensors -llama_model_loader: - type q3_K: 267 tensors -llama_model_loader: - type q4_K: 362 tensors -llama_model_loader: - type q5_K: 20 tensors -llama_model_loader: - type q6_K: 11 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q3_K - Medium -print_info: file size = 96.99 GiB (3.54 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 262144 -print_info: n_embd = 4096 -print_info: n_layer = 94 -print_info: n_head = 64 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 16 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 12288 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 5000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 262144 -print_info: rope_finetuned = unknown -print_info: model type = 235B.A22B -print_info: model params = 235.09 B -print_info: general.name = Qwen3-235B-A22B-Instruct-2507 -print_info: n_ff_exp = 1536 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 94 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 95/95 layers to GPU -load_tensors: CPU model buffer size = 333.84 MiB -load_tensors: ROCm0 model buffer size = 98988.40 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 5000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 752.00 MiB -llama_kv_cache_unified: size = 752.00 MiB ( 4096 cells, 94 layers, 1/ 1 seqs), K (f16): 376.00 MiB, V (f16): 376.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 304.75 MiB -llama_context: ROCm_Host compute buffer size = 16.01 MiB -llama_context: graph nodes = 6023 -llama_context: graph splits = 2 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 4068503868 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello, - -llama_perf_sampler_print: sampling time = 0.06 ms / 2 runs ( 0.03 ms per token, 35087.72 tokens per second) -llama_perf_context_print: load time = 34531.90 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 74.04 ms / 1 runs ( 74.04 ms per token, 13.51 tokens per second) -llama_perf_context_print: total time = 87.46 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 38.606270419s - Run #3 status: 0 - → Avg over 3 runs: 39.062s diff --git a/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm7_beta.log b/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm7_beta.log deleted file mode 100644 index a59adde..0000000 --- a/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm7_beta.log +++ /dev/null @@ -1,184 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 2 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 48 key-value pairs and 1131 tensors from /home/kyuz0/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-235B-A22B-Instruct-2507 -llama_model_loader: - kv 3: general.version str = 2507 -llama_model_loader: - kv 4: general.finetune str = Instruct -llama_model_loader: - kv 5: general.basename str = Qwen3-235B-A22B-Instruct-2507 -llama_model_loader: - kv 6: general.quantized_by str = Unsloth -llama_model_loader: - kv 7: general.size_label str = 235B-A22B -llama_model_loader: - kv 8: general.license str = apache-2.0 -llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen3-235... -llama_model_loader: - kv 10: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 11: general.base_model.count u32 = 1 -llama_model_loader: - kv 12: general.base_model.0.name str = Qwen3 235B A22B Instruct 2507 -llama_model_loader: - kv 13: general.base_model.0.version str = 2507 -llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen -llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-235... -llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "text-generation"] -llama_model_loader: - kv 17: qwen3moe.block_count u32 = 94 -llama_model_loader: - kv 18: qwen3moe.context_length u32 = 262144 -llama_model_loader: - kv 19: qwen3moe.embedding_length u32 = 4096 -llama_model_loader: - kv 20: qwen3moe.feed_forward_length u32 = 12288 -llama_model_loader: - kv 21: qwen3moe.attention.head_count u32 = 64 -llama_model_loader: - kv 22: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 23: qwen3moe.rope.freq_base f32 = 5000000.000000 -llama_model_loader: - kv 24: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 25: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 26: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 27: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 28: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 29: qwen3moe.expert_feed_forward_length u32 = 1536 -llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 31: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 38: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 39: general.quantization_version u32 = 2 -llama_model_loader: - kv 40: general.file_type u32 = 12 -llama_model_loader: - kv 41: quantize.imatrix.file str = Qwen3-235B-A22B-Instruct-2507-GGUF/im... -llama_model_loader: - kv 42: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-235B-A22B-I... -llama_model_loader: - kv 43: quantize.imatrix.entries_count u32 = 745 -llama_model_loader: - kv 44: quantize.imatrix.chunks_count u32 = 693 -llama_model_loader: - kv 45: split.no u16 = 0 -llama_model_loader: - kv 46: split.tensors.count i32 = 1131 -llama_model_loader: - kv 47: split.count u16 = 3 -llama_model_loader: - type f32: 471 tensors -llama_model_loader: - type q3_K: 267 tensors -llama_model_loader: - type q4_K: 362 tensors -llama_model_loader: - type q5_K: 20 tensors -llama_model_loader: - type q6_K: 11 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q3_K - Medium -print_info: file size = 96.99 GiB (3.54 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 262144 -print_info: n_embd = 4096 -print_info: n_layer = 94 -print_info: n_head = 64 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 16 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 12288 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 5000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 262144 -print_info: rope_finetuned = unknown -print_info: model type = 235B.A22B -print_info: model params = 235.09 B -print_info: general.name = Qwen3-235B-A22B-Instruct-2507 -print_info: n_ff_exp = 1536 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 94 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 95/95 layers to GPU -load_tensors: CPU model buffer size = 333.84 MiB -load_tensors: ROCm0 model buffer size = 98988.40 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 5000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 752.00 MiB -llama_kv_cache_unified: size = 752.00 MiB ( 4096 cells, 94 layers, 1/ 1 seqs), K (f16): 376.00 MiB, V (f16): 376.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 304.75 MiB -llama_context: ROCm_Host compute buffer size = 16.01 MiB -llama_context: graph nodes = 6023 -llama_context: graph splits = 2 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 698255200 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello! - -llama_perf_sampler_print: sampling time = 0.05 ms / 2 runs ( 0.03 ms per token, 37037.04 tokens per second) -llama_perf_context_print: load time = 34496.41 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 74.48 ms / 1 runs ( 74.48 ms per token, 13.43 tokens per second) -llama_perf_context_print: total time = 87.80 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 35.247053632s - Run #3 status: 0 - → Avg over 3 runs: 35.392s diff --git a/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm7_rc.log b/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm7_rc.log deleted file mode 100644 index 53a04cc..0000000 --- a/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm7_rc.log +++ /dev/null @@ -1,184 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 2 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 48 key-value pairs and 1131 tensors from /home/kyuz0/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-235B-A22B-Instruct-2507 -llama_model_loader: - kv 3: general.version str = 2507 -llama_model_loader: - kv 4: general.finetune str = Instruct -llama_model_loader: - kv 5: general.basename str = Qwen3-235B-A22B-Instruct-2507 -llama_model_loader: - kv 6: general.quantized_by str = Unsloth -llama_model_loader: - kv 7: general.size_label str = 235B-A22B -llama_model_loader: - kv 8: general.license str = apache-2.0 -llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen3-235... -llama_model_loader: - kv 10: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 11: general.base_model.count u32 = 1 -llama_model_loader: - kv 12: general.base_model.0.name str = Qwen3 235B A22B Instruct 2507 -llama_model_loader: - kv 13: general.base_model.0.version str = 2507 -llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen -llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-235... -llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "text-generation"] -llama_model_loader: - kv 17: qwen3moe.block_count u32 = 94 -llama_model_loader: - kv 18: qwen3moe.context_length u32 = 262144 -llama_model_loader: - kv 19: qwen3moe.embedding_length u32 = 4096 -llama_model_loader: - kv 20: qwen3moe.feed_forward_length u32 = 12288 -llama_model_loader: - kv 21: qwen3moe.attention.head_count u32 = 64 -llama_model_loader: - kv 22: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 23: qwen3moe.rope.freq_base f32 = 5000000.000000 -llama_model_loader: - kv 24: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 25: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 26: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 27: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 28: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 29: qwen3moe.expert_feed_forward_length u32 = 1536 -llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 31: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 38: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 39: general.quantization_version u32 = 2 -llama_model_loader: - kv 40: general.file_type u32 = 12 -llama_model_loader: - kv 41: quantize.imatrix.file str = Qwen3-235B-A22B-Instruct-2507-GGUF/im... -llama_model_loader: - kv 42: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-235B-A22B-I... -llama_model_loader: - kv 43: quantize.imatrix.entries_count u32 = 745 -llama_model_loader: - kv 44: quantize.imatrix.chunks_count u32 = 693 -llama_model_loader: - kv 45: split.no u16 = 0 -llama_model_loader: - kv 46: split.tensors.count i32 = 1131 -llama_model_loader: - kv 47: split.count u16 = 3 -llama_model_loader: - type f32: 471 tensors -llama_model_loader: - type q3_K: 267 tensors -llama_model_loader: - type q4_K: 362 tensors -llama_model_loader: - type q5_K: 20 tensors -llama_model_loader: - type q6_K: 11 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q3_K - Medium -print_info: file size = 96.99 GiB (3.54 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 262144 -print_info: n_embd = 4096 -print_info: n_layer = 94 -print_info: n_head = 64 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 16 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 12288 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 5000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 262144 -print_info: rope_finetuned = unknown -print_info: model type = 235B.A22B -print_info: model params = 235.09 B -print_info: general.name = Qwen3-235B-A22B-Instruct-2507 -print_info: n_ff_exp = 1536 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 94 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 95/95 layers to GPU -load_tensors: CPU model buffer size = 333.84 MiB -load_tensors: ROCm0 model buffer size = 98988.40 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 5000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 752.00 MiB -llama_kv_cache_unified: size = 752.00 MiB ( 4096 cells, 94 layers, 1/ 1 seqs), K (f16): 376.00 MiB, V (f16): 376.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 304.75 MiB -llama_context: ROCm_Host compute buffer size = 16.01 MiB -llama_context: graph nodes = 6023 -llama_context: graph splits = 2 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 715670654 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello, - -llama_perf_sampler_print: sampling time = 0.06 ms / 2 runs ( 0.03 ms per token, 34482.76 tokens per second) -llama_perf_context_print: load time = 31968.90 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 73.79 ms / 1 runs ( 73.79 ms per token, 13.55 tokens per second) -llama_perf_context_print: total time = 87.27 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 32.781452355s - Run #3 status: 0 - → Avg over 3 runs: 33.458s diff --git a/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__vulkan_amdvlk.log b/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__vulkan_amdvlk.log deleted file mode 100644 index 6d7f34b..0000000 --- a/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__vulkan_amdvlk.log +++ /dev/null @@ -1,182 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat -build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free -llama_model_loader: additional 2 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 48 key-value pairs and 1131 tensors from /home/kyuz0/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-235B-A22B-Instruct-2507 -llama_model_loader: - kv 3: general.version str = 2507 -llama_model_loader: - kv 4: general.finetune str = Instruct -llama_model_loader: - kv 5: general.basename str = Qwen3-235B-A22B-Instruct-2507 -llama_model_loader: - kv 6: general.quantized_by str = Unsloth -llama_model_loader: - kv 7: general.size_label str = 235B-A22B -llama_model_loader: - kv 8: general.license str = apache-2.0 -llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen3-235... -llama_model_loader: - kv 10: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 11: general.base_model.count u32 = 1 -llama_model_loader: - kv 12: general.base_model.0.name str = Qwen3 235B A22B Instruct 2507 -llama_model_loader: - kv 13: general.base_model.0.version str = 2507 -llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen -llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-235... -llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "text-generation"] -llama_model_loader: - kv 17: qwen3moe.block_count u32 = 94 -llama_model_loader: - kv 18: qwen3moe.context_length u32 = 262144 -llama_model_loader: - kv 19: qwen3moe.embedding_length u32 = 4096 -llama_model_loader: - kv 20: qwen3moe.feed_forward_length u32 = 12288 -llama_model_loader: - kv 21: qwen3moe.attention.head_count u32 = 64 -llama_model_loader: - kv 22: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 23: qwen3moe.rope.freq_base f32 = 5000000.000000 -llama_model_loader: - kv 24: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 25: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 26: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 27: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 28: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 29: qwen3moe.expert_feed_forward_length u32 = 1536 -llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 31: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 38: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 39: general.quantization_version u32 = 2 -llama_model_loader: - kv 40: general.file_type u32 = 12 -llama_model_loader: - kv 41: quantize.imatrix.file str = Qwen3-235B-A22B-Instruct-2507-GGUF/im... -llama_model_loader: - kv 42: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-235B-A22B-I... -llama_model_loader: - kv 43: quantize.imatrix.entries_count u32 = 745 -llama_model_loader: - kv 44: quantize.imatrix.chunks_count u32 = 693 -llama_model_loader: - kv 45: split.no u16 = 0 -llama_model_loader: - kv 46: split.tensors.count i32 = 1131 -llama_model_loader: - kv 47: split.count u16 = 3 -llama_model_loader: - type f32: 471 tensors -llama_model_loader: - type q3_K: 267 tensors -llama_model_loader: - type q4_K: 362 tensors -llama_model_loader: - type q5_K: 20 tensors -llama_model_loader: - type q6_K: 11 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q3_K - Medium -print_info: file size = 96.99 GiB (3.54 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 262144 -print_info: n_embd = 4096 -print_info: n_layer = 94 -print_info: n_head = 64 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 16 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 12288 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 5000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 262144 -print_info: rope_finetuned = unknown -print_info: model type = 235B.A22B -print_info: model params = 235.09 B -print_info: general.name = Qwen3-235B-A22B-Instruct-2507 -print_info: n_ff_exp = 1536 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 94 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 95/95 layers to GPU -load_tensors: Vulkan0 model buffer size = 98988.40 MiB -load_tensors: CPU model buffer size = 333.84 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 5000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: Vulkan0 KV buffer size = 752.00 MiB -llama_kv_cache_unified: size = 752.00 MiB ( 4096 cells, 94 layers, 1/ 1 seqs), K (f16): 376.00 MiB, V (f16): 376.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 304.75 MiB -llama_context: Vulkan_Host compute buffer size = 16.01 MiB -llama_context: graph nodes = 6023 -llama_context: graph splits = 2 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 4076614647 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello, - -llama_perf_sampler_print: sampling time = 0.07 ms / 2 runs ( 0.04 ms per token, 28571.43 tokens per second) -llama_perf_context_print: load time = 40072.88 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 67.40 ms / 1 runs ( 67.40 ms per token, 14.84 tokens per second) -llama_perf_context_print: total time = 86.12 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 43.569299668s - Run #3 status: 0 - → Avg over 3 runs: 44.883s diff --git a/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__vulkan_radv.log b/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__vulkan_radv.log deleted file mode 100644 index e2045f0..0000000 --- a/benchmark/loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__vulkan_radv.log +++ /dev/null @@ -1,182 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free -llama_model_loader: additional 2 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 48 key-value pairs and 1131 tensors from /home/kyuz0/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-235B-A22B-Instruct-2507 -llama_model_loader: - kv 3: general.version str = 2507 -llama_model_loader: - kv 4: general.finetune str = Instruct -llama_model_loader: - kv 5: general.basename str = Qwen3-235B-A22B-Instruct-2507 -llama_model_loader: - kv 6: general.quantized_by str = Unsloth -llama_model_loader: - kv 7: general.size_label str = 235B-A22B -llama_model_loader: - kv 8: general.license str = apache-2.0 -llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen3-235... -llama_model_loader: - kv 10: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 11: general.base_model.count u32 = 1 -llama_model_loader: - kv 12: general.base_model.0.name str = Qwen3 235B A22B Instruct 2507 -llama_model_loader: - kv 13: general.base_model.0.version str = 2507 -llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen -llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-235... -llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "text-generation"] -llama_model_loader: - kv 17: qwen3moe.block_count u32 = 94 -llama_model_loader: - kv 18: qwen3moe.context_length u32 = 262144 -llama_model_loader: - kv 19: qwen3moe.embedding_length u32 = 4096 -llama_model_loader: - kv 20: qwen3moe.feed_forward_length u32 = 12288 -llama_model_loader: - kv 21: qwen3moe.attention.head_count u32 = 64 -llama_model_loader: - kv 22: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 23: qwen3moe.rope.freq_base f32 = 5000000.000000 -llama_model_loader: - kv 24: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 25: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 26: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 27: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 28: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 29: qwen3moe.expert_feed_forward_length u32 = 1536 -llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 31: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 38: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 39: general.quantization_version u32 = 2 -llama_model_loader: - kv 40: general.file_type u32 = 12 -llama_model_loader: - kv 41: quantize.imatrix.file str = Qwen3-235B-A22B-Instruct-2507-GGUF/im... -llama_model_loader: - kv 42: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-235B-A22B-I... -llama_model_loader: - kv 43: quantize.imatrix.entries_count u32 = 745 -llama_model_loader: - kv 44: quantize.imatrix.chunks_count u32 = 693 -llama_model_loader: - kv 45: split.no u16 = 0 -llama_model_loader: - kv 46: split.tensors.count i32 = 1131 -llama_model_loader: - kv 47: split.count u16 = 3 -llama_model_loader: - type f32: 471 tensors -llama_model_loader: - type q3_K: 267 tensors -llama_model_loader: - type q4_K: 362 tensors -llama_model_loader: - type q5_K: 20 tensors -llama_model_loader: - type q6_K: 11 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q3_K - Medium -print_info: file size = 96.99 GiB (3.54 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 262144 -print_info: n_embd = 4096 -print_info: n_layer = 94 -print_info: n_head = 64 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 16 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 12288 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 5000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 262144 -print_info: rope_finetuned = unknown -print_info: model type = 235B.A22B -print_info: model params = 235.09 B -print_info: general.name = Qwen3-235B-A22B-Instruct-2507 -print_info: n_ff_exp = 1536 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 94 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 95/95 layers to GPU -load_tensors: Vulkan0 model buffer size = 98988.40 MiB -load_tensors: CPU model buffer size = 333.84 MiB -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 5000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: Vulkan0 KV buffer size = 752.00 MiB -llama_kv_cache_unified: size = 752.00 MiB ( 4096 cells, 94 layers, 1/ 1 seqs), K (f16): 376.00 MiB, V (f16): 376.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 304.75 MiB -llama_context: Vulkan_Host compute buffer size = 16.01 MiB -llama_context: graph nodes = 6023 -llama_context: graph splits = 2 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 1959920459 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello, - -llama_perf_sampler_print: sampling time = 0.08 ms / 2 runs ( 0.04 ms per token, 25641.03 tokens per second) -llama_perf_context_print: load time = 40114.24 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 67.08 ms / 1 runs ( 67.08 ms per token, 14.91 tokens per second) -llama_perf_context_print: total time = 86.46 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 40.621909942s - Run #3 status: 0 - → Avg over 3 runs: 40.722s diff --git a/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2.log b/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2.log deleted file mode 100644 index 58caccd..0000000 --- a/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2.log +++ /dev/null @@ -1,167 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from /home/kyuz0/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B -llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B -llama_model_loader: - kv 4: general.quantized_by str = Unsloth -llama_model_loader: - kv 5: general.size_label str = 30B-A3B -llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48 -llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960 -llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048 -llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144 -llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32 -llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 18: general.file_type u32 = 32 -llama_model_loader: - kv 19: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 20: qwen3moe.expert_feed_forward_length u32 = 768 -llama_model_loader: - kv 21: general.quantization_version u32 = 2 -llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 23: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 31: split.no u16 = 0 -llama_model_loader: - kv 32: split.count u16 = 2 -llama_model_loader: - kv 33: split.tensors.count i32 = 579 -llama_model_loader: - type f32: 241 tensors -llama_model_loader: - type bf16: 338 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 56.89 GiB (16.01 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 40960 -print_info: n_embd = 2048 -print_info: n_layer = 48 -print_info: n_head = 32 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 6144 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 40960 -print_info: rope_finetuned = unknown -print_info: model type = 30B.A3B -print_info: model params = 30.53 B -print_info: general.name = Qwen3-30B-A3B -print_info: n_ff_exp = 768 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: ROCm0 model buffer size = 57666.30 MiB -load_tensors: ROCm_Host model buffer size = 593.50 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 384.00 MiB -llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 300.75 MiB -llama_context: ROCm_Host compute buffer size = 8.01 MiB -llama_context: graph nodes = 3079 -llama_context: graph splits = 1 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 1093628111 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello - - -llama_perf_sampler_print: sampling time = 0.06 ms / 2 runs ( 0.03 ms per token, 34482.76 tokens per second) -llama_perf_context_print: load time = 19374.51 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 42.85 ms / 1 runs ( 42.85 ms per token, 23.34 tokens per second) -llama_perf_context_print: total time = 73.04 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 23.364750813s - Run #3 status: 0 - → Avg over 3 runs: 22.166s diff --git a/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm7_beta.log b/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm7_beta.log deleted file mode 100644 index 71d5dad..0000000 --- a/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm7_beta.log +++ /dev/null @@ -1,167 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from /home/kyuz0/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B -llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B -llama_model_loader: - kv 4: general.quantized_by str = Unsloth -llama_model_loader: - kv 5: general.size_label str = 30B-A3B -llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48 -llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960 -llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048 -llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144 -llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32 -llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 18: general.file_type u32 = 32 -llama_model_loader: - kv 19: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 20: qwen3moe.expert_feed_forward_length u32 = 768 -llama_model_loader: - kv 21: general.quantization_version u32 = 2 -llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 23: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 31: split.no u16 = 0 -llama_model_loader: - kv 32: split.count u16 = 2 -llama_model_loader: - kv 33: split.tensors.count i32 = 579 -llama_model_loader: - type f32: 241 tensors -llama_model_loader: - type bf16: 338 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 56.89 GiB (16.01 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 40960 -print_info: n_embd = 2048 -print_info: n_layer = 48 -print_info: n_head = 32 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 6144 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 40960 -print_info: rope_finetuned = unknown -print_info: model type = 30B.A3B -print_info: model params = 30.53 B -print_info: general.name = Qwen3-30B-A3B -print_info: n_ff_exp = 768 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: ROCm0 model buffer size = 57666.30 MiB -load_tensors: ROCm_Host model buffer size = 593.50 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 384.00 MiB -llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 300.75 MiB -llama_context: ROCm_Host compute buffer size = 8.01 MiB -llama_context: graph nodes = 3079 -llama_context: graph splits = 1 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 3515911169 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello * - -llama_perf_sampler_print: sampling time = 0.05 ms / 2 runs ( 0.03 ms per token, 37037.04 tokens per second) -llama_perf_context_print: load time = 12423.68 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 43.15 ms / 1 runs ( 43.15 ms per token, 23.18 tokens per second) -llama_perf_context_print: total time = 62.68 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 13.032265401s - Run #3 status: 0 - → Avg over 3 runs: 15.930s diff --git a/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm7_rc.log b/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm7_rc.log deleted file mode 100644 index 7d9b984..0000000 --- a/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm7_rc.log +++ /dev/null @@ -1,167 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from /home/kyuz0/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B -llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B -llama_model_loader: - kv 4: general.quantized_by str = Unsloth -llama_model_loader: - kv 5: general.size_label str = 30B-A3B -llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48 -llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960 -llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048 -llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144 -llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32 -llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 18: general.file_type u32 = 32 -llama_model_loader: - kv 19: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 20: qwen3moe.expert_feed_forward_length u32 = 768 -llama_model_loader: - kv 21: general.quantization_version u32 = 2 -llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 23: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 31: split.no u16 = 0 -llama_model_loader: - kv 32: split.count u16 = 2 -llama_model_loader: - kv 33: split.tensors.count i32 = 579 -llama_model_loader: - type f32: 241 tensors -llama_model_loader: - type bf16: 338 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 56.89 GiB (16.01 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 40960 -print_info: n_embd = 2048 -print_info: n_layer = 48 -print_info: n_head = 32 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 6144 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 40960 -print_info: rope_finetuned = unknown -print_info: model type = 30B.A3B -print_info: model params = 30.53 B -print_info: general.name = Qwen3-30B-A3B -print_info: n_ff_exp = 768 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: ROCm0 model buffer size = 57666.30 MiB -load_tensors: ROCm_Host model buffer size = 593.50 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 384.00 MiB -llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 300.75 MiB -llama_context: ROCm_Host compute buffer size = 8.01 MiB -llama_context: graph nodes = 3079 -llama_context: graph splits = 1 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 4057380724 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello this - -llama_perf_sampler_print: sampling time = 0.05 ms / 2 runs ( 0.03 ms per token, 37037.04 tokens per second) -llama_perf_context_print: load time = 21106.31 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 43.24 ms / 1 runs ( 43.24 ms per token, 23.13 tokens per second) -llama_perf_context_print: total time = 62.41 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 21.852416396s - Run #3 status: 0 - → Avg over 3 runs: 22.669s diff --git a/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__vulkan_amdvlk.log b/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__vulkan_amdvlk.log deleted file mode 100644 index 1a2f40e..0000000 --- a/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__vulkan_amdvlk.log +++ /dev/null @@ -1,165 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat -build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from /home/kyuz0/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B -llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B -llama_model_loader: - kv 4: general.quantized_by str = Unsloth -llama_model_loader: - kv 5: general.size_label str = 30B-A3B -llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48 -llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960 -llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048 -llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144 -llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32 -llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 18: general.file_type u32 = 32 -llama_model_loader: - kv 19: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 20: qwen3moe.expert_feed_forward_length u32 = 768 -llama_model_loader: - kv 21: general.quantization_version u32 = 2 -llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 23: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 31: split.no u16 = 0 -llama_model_loader: - kv 32: split.count u16 = 2 -llama_model_loader: - kv 33: split.tensors.count i32 = 579 -llama_model_loader: - type f32: 241 tensors -llama_model_loader: - type bf16: 338 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 56.89 GiB (16.01 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 40960 -print_info: n_embd = 2048 -print_info: n_layer = 48 -print_info: n_head = 32 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 6144 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 40960 -print_info: rope_finetuned = unknown -print_info: model type = 30B.A3B -print_info: model params = 30.53 B -print_info: general.name = Qwen3-30B-A3B -print_info: n_ff_exp = 768 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: Vulkan0 model buffer size = 57666.30 MiB -load_tensors: Vulkan_Host model buffer size = 593.50 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: Vulkan0 KV buffer size = 384.00 MiB -llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 304.75 MiB -llama_context: Vulkan_Host compute buffer size = 12.01 MiB -llama_context: graph nodes = 3079 -llama_context: graph splits = 2 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 157667903 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello and - -llama_perf_sampler_print: sampling time = 0.08 ms / 2 runs ( 0.04 ms per token, 24390.24 tokens per second) -llama_perf_context_print: load time = 10008.37 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 128.73 ms / 1 runs ( 128.73 ms per token, 7.77 tokens per second) -llama_perf_context_print: total time = 155.88 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 10.759732568s - Run #3 status: 0 - → Avg over 3 runs: 12.935s diff --git a/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__vulkan_radv.log b/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__vulkan_radv.log deleted file mode 100644 index 4de7e68..0000000 --- a/benchmark/loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__vulkan_radv.log +++ /dev/null @@ -1,165 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from /home/kyuz0/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B -llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B -llama_model_loader: - kv 4: general.quantized_by str = Unsloth -llama_model_loader: - kv 5: general.size_label str = 30B-A3B -llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48 -llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960 -llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048 -llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144 -llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32 -llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 18: general.file_type u32 = 32 -llama_model_loader: - kv 19: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 20: qwen3moe.expert_feed_forward_length u32 = 768 -llama_model_loader: - kv 21: general.quantization_version u32 = 2 -llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 23: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... -llama_model_loader: - kv 31: split.no u16 = 0 -llama_model_loader: - kv 32: split.count u16 = 2 -llama_model_loader: - kv 33: split.tensors.count i32 = 579 -llama_model_loader: - type f32: 241 tensors -llama_model_loader: - type bf16: 338 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 56.89 GiB (16.01 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 40960 -print_info: n_embd = 2048 -print_info: n_layer = 48 -print_info: n_head = 32 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 6144 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 40960 -print_info: rope_finetuned = unknown -print_info: model type = 30B.A3B -print_info: model params = 30.53 B -print_info: general.name = Qwen3-30B-A3B -print_info: n_ff_exp = 768 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: Vulkan0 model buffer size = 57666.30 MiB -load_tensors: Vulkan_Host model buffer size = 593.50 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: Vulkan0 KV buffer size = 384.00 MiB -llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 304.75 MiB -llama_context: Vulkan_Host compute buffer size = 12.01 MiB -llama_context: graph nodes = 3079 -llama_context: graph splits = 2 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 1118253234 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello - - -llama_perf_sampler_print: sampling time = 0.08 ms / 2 runs ( 0.04 ms per token, 25316.46 tokens per second) -llama_perf_context_print: load time = 12501.96 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 137.49 ms / 1 runs ( 137.49 ms per token, 7.27 tokens per second) -llama_perf_context_print: total time = 164.69 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 13.022605949s - Run #3 status: 0 - → Avg over 3 runs: 14.761s diff --git a/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm6_4_2.log b/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm6_4_2.log deleted file mode 100644 index 4375988..0000000 --- a/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm6_4_2.log +++ /dev/null @@ -1,176 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 43 key-value pairs and 579 tensors from /home/kyuz0/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-Coder-30B-A3B-Instruct -llama_model_loader: - kv 3: general.finetune str = Instruct -llama_model_loader: - kv 4: general.basename str = Qwen3-Coder-30B-A3B-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 30B-A3B -llama_model_loader: - kv 7: general.license str = apache-2.0 -llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod... -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 Coder 30B A3B Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod... -llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"] -llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48 -llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144 -llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048 -llama_model_loader: - kv 18: qwen3moe.feed_forward_length u32 = 5472 -llama_model_loader: - kv 19: qwen3moe.attention.head_count u32 = 32 -llama_model_loader: - kv 20: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 21: qwen3moe.rope.freq_base f32 = 10000000.000000 -llama_model_loader: - kv 22: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 23: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 24: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 25: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 26: general.file_type u32 = 32 -llama_model_loader: - kv 27: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 28: qwen3moe.expert_feed_forward_length u32 = 768 -llama_model_loader: - kv 29: qwen3moe.expert_shared_feed_forward_length u32 = 0 -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 32: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 33: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 37: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 38: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 39: tokenizer.chat_template str = {#- Copyright 2025-present the Unslot... -llama_model_loader: - kv 40: split.no u16 = 0 -llama_model_loader: - kv 41: split.count u16 = 2 -llama_model_loader: - kv 42: split.tensors.count i32 = 579 -llama_model_loader: - type f32: 241 tensors -llama_model_loader: - type bf16: 338 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 56.89 GiB (16.01 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 262144 -print_info: n_embd = 2048 -print_info: n_layer = 48 -print_info: n_head = 32 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 5472 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 10000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 262144 -print_info: rope_finetuned = unknown -print_info: model type = 30B.A3B -print_info: model params = 30.53 B -print_info: general.name = Qwen3-Coder-30B-A3B-Instruct -print_info: n_ff_exp = 768 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: ROCm0 model buffer size = 57666.30 MiB -load_tensors: ROCm_Host model buffer size = 593.50 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 10000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 384.00 MiB -llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 300.75 MiB -llama_context: ROCm_Host compute buffer size = 8.01 MiB -llama_context: graph nodes = 3079 -llama_context: graph splits = 1 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 3288748167 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello: - -llama_perf_sampler_print: sampling time = 0.05 ms / 2 runs ( 0.03 ms per token, 38461.54 tokens per second) -llama_perf_context_print: load time = 12175.61 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 42.43 ms / 1 runs ( 42.43 ms per token, 23.57 tokens per second) -llama_perf_context_print: total time = 81.77 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 16.099845533s - Run #3 status: 0 - → Avg over 3 runs: 17.779s diff --git a/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm7_beta.log b/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm7_beta.log deleted file mode 100644 index c80143b..0000000 --- a/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm7_beta.log +++ /dev/null @@ -1,176 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 43 key-value pairs and 579 tensors from /home/kyuz0/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-Coder-30B-A3B-Instruct -llama_model_loader: - kv 3: general.finetune str = Instruct -llama_model_loader: - kv 4: general.basename str = Qwen3-Coder-30B-A3B-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 30B-A3B -llama_model_loader: - kv 7: general.license str = apache-2.0 -llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod... -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 Coder 30B A3B Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod... -llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"] -llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48 -llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144 -llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048 -llama_model_loader: - kv 18: qwen3moe.feed_forward_length u32 = 5472 -llama_model_loader: - kv 19: qwen3moe.attention.head_count u32 = 32 -llama_model_loader: - kv 20: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 21: qwen3moe.rope.freq_base f32 = 10000000.000000 -llama_model_loader: - kv 22: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 23: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 24: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 25: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 26: general.file_type u32 = 32 -llama_model_loader: - kv 27: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 28: qwen3moe.expert_feed_forward_length u32 = 768 -llama_model_loader: - kv 29: qwen3moe.expert_shared_feed_forward_length u32 = 0 -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 32: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 33: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 37: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 38: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 39: tokenizer.chat_template str = {#- Copyright 2025-present the Unslot... -llama_model_loader: - kv 40: split.no u16 = 0 -llama_model_loader: - kv 41: split.count u16 = 2 -llama_model_loader: - kv 42: split.tensors.count i32 = 579 -llama_model_loader: - type f32: 241 tensors -llama_model_loader: - type bf16: 338 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 56.89 GiB (16.01 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 262144 -print_info: n_embd = 2048 -print_info: n_layer = 48 -print_info: n_head = 32 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 5472 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 10000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 262144 -print_info: rope_finetuned = unknown -print_info: model type = 30B.A3B -print_info: model params = 30.53 B -print_info: general.name = Qwen3-Coder-30B-A3B-Instruct -print_info: n_ff_exp = 768 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: ROCm0 model buffer size = 57666.30 MiB -load_tensors: ROCm_Host model buffer size = 593.50 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 10000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 384.00 MiB -llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 300.75 MiB -llama_context: ROCm_Host compute buffer size = 8.01 MiB -llama_context: graph nodes = 3079 -llama_context: graph splits = 1 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 3173540432 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello: - -llama_perf_sampler_print: sampling time = 0.06 ms / 2 runs ( 0.03 ms per token, 35087.72 tokens per second) -llama_perf_context_print: load time = 11733.11 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 42.68 ms / 1 runs ( 42.68 ms per token, 23.43 tokens per second) -llama_perf_context_print: total time = 82.14 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 12.376138939s - Run #3 status: 0 - → Avg over 3 runs: 14.392s diff --git a/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm7_rc.log b/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm7_rc.log deleted file mode 100644 index 729954c..0000000 --- a/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm7_rc.log +++ /dev/null @@ -1,176 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 43 key-value pairs and 579 tensors from /home/kyuz0/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-Coder-30B-A3B-Instruct -llama_model_loader: - kv 3: general.finetune str = Instruct -llama_model_loader: - kv 4: general.basename str = Qwen3-Coder-30B-A3B-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 30B-A3B -llama_model_loader: - kv 7: general.license str = apache-2.0 -llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod... -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 Coder 30B A3B Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod... -llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"] -llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48 -llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144 -llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048 -llama_model_loader: - kv 18: qwen3moe.feed_forward_length u32 = 5472 -llama_model_loader: - kv 19: qwen3moe.attention.head_count u32 = 32 -llama_model_loader: - kv 20: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 21: qwen3moe.rope.freq_base f32 = 10000000.000000 -llama_model_loader: - kv 22: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 23: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 24: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 25: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 26: general.file_type u32 = 32 -llama_model_loader: - kv 27: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 28: qwen3moe.expert_feed_forward_length u32 = 768 -llama_model_loader: - kv 29: qwen3moe.expert_shared_feed_forward_length u32 = 0 -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 32: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 33: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 37: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 38: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 39: tokenizer.chat_template str = {#- Copyright 2025-present the Unslot... -llama_model_loader: - kv 40: split.no u16 = 0 -llama_model_loader: - kv 41: split.count u16 = 2 -llama_model_loader: - kv 42: split.tensors.count i32 = 579 -llama_model_loader: - type f32: 241 tensors -llama_model_loader: - type bf16: 338 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 56.89 GiB (16.01 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 262144 -print_info: n_embd = 2048 -print_info: n_layer = 48 -print_info: n_head = 32 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 5472 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 10000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 262144 -print_info: rope_finetuned = unknown -print_info: model type = 30B.A3B -print_info: model params = 30.53 B -print_info: general.name = Qwen3-Coder-30B-A3B-Instruct -print_info: n_ff_exp = 768 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: ROCm0 model buffer size = 57666.30 MiB -load_tensors: ROCm_Host model buffer size = 593.50 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 10000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 384.00 MiB -llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 300.75 MiB -llama_context: ROCm_Host compute buffer size = 8.01 MiB -llama_context: graph nodes = 3079 -llama_context: graph splits = 1 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 1388157865 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello: - -llama_perf_sampler_print: sampling time = 0.06 ms / 2 runs ( 0.03 ms per token, 36363.64 tokens per second) -llama_perf_context_print: load time = 11788.33 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 43.56 ms / 1 runs ( 43.56 ms per token, 22.95 tokens per second) -llama_perf_context_print: total time = 82.77 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 12.528214562s - Run #3 status: 0 - → Avg over 3 runs: 16.161s diff --git a/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__vulkan_amdvlk.log b/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__vulkan_amdvlk.log deleted file mode 100644 index a26f8f8..0000000 --- a/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__vulkan_amdvlk.log +++ /dev/null @@ -1,174 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat -build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 43 key-value pairs and 579 tensors from /home/kyuz0/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-Coder-30B-A3B-Instruct -llama_model_loader: - kv 3: general.finetune str = Instruct -llama_model_loader: - kv 4: general.basename str = Qwen3-Coder-30B-A3B-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 30B-A3B -llama_model_loader: - kv 7: general.license str = apache-2.0 -llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod... -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 Coder 30B A3B Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod... -llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"] -llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48 -llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144 -llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048 -llama_model_loader: - kv 18: qwen3moe.feed_forward_length u32 = 5472 -llama_model_loader: - kv 19: qwen3moe.attention.head_count u32 = 32 -llama_model_loader: - kv 20: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 21: qwen3moe.rope.freq_base f32 = 10000000.000000 -llama_model_loader: - kv 22: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 23: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 24: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 25: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 26: general.file_type u32 = 32 -llama_model_loader: - kv 27: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 28: qwen3moe.expert_feed_forward_length u32 = 768 -llama_model_loader: - kv 29: qwen3moe.expert_shared_feed_forward_length u32 = 0 -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 32: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 33: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 37: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 38: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 39: tokenizer.chat_template str = {#- Copyright 2025-present the Unslot... -llama_model_loader: - kv 40: split.no u16 = 0 -llama_model_loader: - kv 41: split.count u16 = 2 -llama_model_loader: - kv 42: split.tensors.count i32 = 579 -llama_model_loader: - type f32: 241 tensors -llama_model_loader: - type bf16: 338 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 56.89 GiB (16.01 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 262144 -print_info: n_embd = 2048 -print_info: n_layer = 48 -print_info: n_head = 32 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 5472 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 10000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 262144 -print_info: rope_finetuned = unknown -print_info: model type = 30B.A3B -print_info: model params = 30.53 B -print_info: general.name = Qwen3-Coder-30B-A3B-Instruct -print_info: n_ff_exp = 768 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: Vulkan0 model buffer size = 57666.30 MiB -load_tensors: Vulkan_Host model buffer size = 593.50 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 10000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: Vulkan0 KV buffer size = 384.00 MiB -llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 304.75 MiB -llama_context: Vulkan_Host compute buffer size = 12.01 MiB -llama_context: graph nodes = 3079 -llama_context: graph splits = 2 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 243266880 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello: - -llama_perf_sampler_print: sampling time = 0.08 ms / 2 runs ( 0.04 ms per token, 26315.79 tokens per second) -llama_perf_context_print: load time = 9973.02 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 130.78 ms / 1 runs ( 130.78 ms per token, 7.65 tokens per second) -llama_perf_context_print: total time = 185.17 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 10.756452016s - Run #3 status: 0 - → Avg over 3 runs: 12.940s diff --git a/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__vulkan_radv.log b/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__vulkan_radv.log deleted file mode 100644 index ef76488..0000000 --- a/benchmark/loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__vulkan_radv.log +++ /dev/null @@ -1,174 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 43 key-value pairs and 579 tensors from /home/kyuz0/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = qwen3moe -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Qwen3-Coder-30B-A3B-Instruct -llama_model_loader: - kv 3: general.finetune str = Instruct -llama_model_loader: - kv 4: general.basename str = Qwen3-Coder-30B-A3B-Instruct -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 30B-A3B -llama_model_loader: - kv 7: general.license str = apache-2.0 -llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod... -llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 10: general.base_model.count u32 = 1 -llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 Coder 30B A3B Instruct -llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen -llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod... -llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"] -llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48 -llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144 -llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048 -llama_model_loader: - kv 18: qwen3moe.feed_forward_length u32 = 5472 -llama_model_loader: - kv 19: qwen3moe.attention.head_count u32 = 32 -llama_model_loader: - kv 20: qwen3moe.attention.head_count_kv u32 = 4 -llama_model_loader: - kv 21: qwen3moe.rope.freq_base f32 = 10000000.000000 -llama_model_loader: - kv 22: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 23: qwen3moe.expert_used_count u32 = 8 -llama_model_loader: - kv 24: qwen3moe.attention.key_length u32 = 128 -llama_model_loader: - kv 25: qwen3moe.attention.value_length u32 = 128 -llama_model_loader: - kv 26: general.file_type u32 = 32 -llama_model_loader: - kv 27: qwen3moe.expert_count u32 = 128 -llama_model_loader: - kv 28: qwen3moe.expert_feed_forward_length u32 = 768 -llama_model_loader: - kv 29: qwen3moe.expert_shared_feed_forward_length u32 = 0 -llama_model_loader: - kv 30: general.quantization_version u32 = 2 -llama_model_loader: - kv 31: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 32: tokenizer.ggml.pre str = qwen2 -llama_model_loader: - kv 33: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... -llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 151645 -llama_model_loader: - kv 37: tokenizer.ggml.padding_token_id u32 = 151654 -llama_model_loader: - kv 38: tokenizer.ggml.add_bos_token bool = false -llama_model_loader: - kv 39: tokenizer.chat_template str = {#- Copyright 2025-present the Unslot... -llama_model_loader: - kv 40: split.no u16 = 0 -llama_model_loader: - kv 41: split.count u16 = 2 -llama_model_loader: - kv 42: split.tensors.count i32 = 579 -llama_model_loader: - type f32: 241 tensors -llama_model_loader: - type bf16: 338 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 56.89 GiB (16.01 BPW) -load: special tokens cache size = 26 -load: token to piece cache size = 0.9311 MB -print_info: arch = qwen3moe -print_info: vocab_only = 0 -print_info: n_ctx_train = 262144 -print_info: n_embd = 2048 -print_info: n_layer = 48 -print_info: n_head = 32 -print_info: n_head_kv = 4 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 512 -print_info: n_embd_v_gqa = 512 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 5472 -print_info: n_expert = 128 -print_info: n_expert_used = 8 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 10000000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 262144 -print_info: rope_finetuned = unknown -print_info: model type = 30B.A3B -print_info: model params = 30.53 B -print_info: general.name = Qwen3-Coder-30B-A3B-Instruct -print_info: n_ff_exp = 768 -print_info: vocab type = BPE -print_info: n_vocab = 151936 -print_info: n_merges = 151387 -print_info: BOS token = 11 ',' -print_info: EOS token = 151645 '<|im_end|>' -print_info: EOT token = 151645 '<|im_end|>' -print_info: PAD token = 151654 '<|vision_pad|>' -print_info: LF token = 198 'Ċ' -print_info: FIM PRE token = 151659 '<|fim_prefix|>' -print_info: FIM SUF token = 151661 '<|fim_suffix|>' -print_info: FIM MID token = 151660 '<|fim_middle|>' -print_info: FIM PAD token = 151662 '<|fim_pad|>' -print_info: FIM REP token = 151663 '<|repo_name|>' -print_info: FIM SEP token = 151664 '<|file_sep|>' -print_info: EOG token = 151643 '<|endoftext|>' -print_info: EOG token = 151645 '<|im_end|>' -print_info: EOG token = 151662 '<|fim_pad|>' -print_info: EOG token = 151663 '<|repo_name|>' -print_info: EOG token = 151664 '<|file_sep|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: Vulkan0 model buffer size = 57666.30 MiB -load_tensors: Vulkan_Host model buffer size = 593.50 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 10000000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.58 MiB -llama_kv_cache_unified: Vulkan0 KV buffer size = 384.00 MiB -llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 304.75 MiB -llama_context: Vulkan_Host compute buffer size = 12.01 MiB -llama_context: graph nodes = 3079 -llama_context: graph splits = 2 -common_init_from_params: added <|endoftext|> logit bias = -inf -common_init_from_params: added <|im_end|> logit bias = -inf -common_init_from_params: added <|fim_pad|> logit bias = -inf -common_init_from_params: added <|repo_name|> logit bias = -inf -common_init_from_params: added <|file_sep|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 2350977163 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0 - -Hello: - -llama_perf_sampler_print: sampling time = 0.07 ms / 2 runs ( 0.04 ms per token, 27027.03 tokens per second) -llama_perf_context_print: load time = 13008.56 ms -llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: eval time = 140.05 ms / 1 runs ( 140.05 ms per token, 7.14 tokens per second) -llama_perf_context_print: total time = 194.09 ms / 2 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 13.570267879s - Run #3 status: 0 - → Avg over 3 runs: 14.021s diff --git a/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2.log b/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2.log deleted file mode 100644 index c653b8c..0000000 --- a/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2.log +++ /dev/null @@ -1,165 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free -llama_model_loader: loaded meta data with 40 key-value pairs and 626 tensors from /home/kyuz0/models/gemma-3-12b-it-UD-Q8_K_XL/gemma-3-12b-it-UD-Q8_K_XL.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = gemma3 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Gemma-3-12B-It -llama_model_loader: - kv 3: general.finetune str = it -llama_model_loader: - kv 4: general.basename str = Gemma-3-12B-It -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 12B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: gemma3.context_length u32 = 131072 -llama_model_loader: - kv 9: gemma3.embedding_length u32 = 3840 -llama_model_loader: - kv 10: gemma3.block_count u32 = 48 -llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 15360 -llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 16 -llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 256 -llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 256 -llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024 -llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear -llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000 -llama_model_loader: - kv 21: tokenizer.ggml.model str = llama -llama_model_loader: - kv 22: tokenizer.ggml.pre str = default -llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... -llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... -llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... -llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2 -llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106 -llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3 -llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0 -llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false -llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... -llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false -llama_model_loader: - kv 34: general.quantization_version u32 = 2 -llama_model_loader: - kv 35: general.file_type u32 = 7 -llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-12b-it-GGUF/imatrix_unsloth.dat -llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-12b-it.txt -llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 336 -llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663 -llama_model_loader: - type f32: 289 tensors -llama_model_loader: - type q8_0: 311 tensors -llama_model_loader: - type bf16: 26 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 13.40 GiB (9.78 BPW) -load: special tokens cache size = 6415 -load: token to piece cache size = 1.9446 MB -print_info: arch = gemma3 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 3840 -print_info: n_layer = 48 -print_info: n_head = 16 -print_info: n_head_kv = 8 -print_info: n_rot = 256 -print_info: n_swa = 1024 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 256 -print_info: n_embd_head_v = 256 -print_info: n_gqa = 2 -print_info: n_embd_k_gqa = 2048 -print_info: n_embd_v_gqa = 2048 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 6.2e-02 -print_info: n_ff = 15360 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 0.125 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 12B -print_info: model params = 11.77 B -print_info: general.name = Gemma-3-12B-It -print_info: vocab type = SPM -print_info: n_vocab = 262208 -print_info: n_merges = 0 -print_info: BOS token = 2 '' -print_info: EOS token = 106 '' -print_info: EOT token = 106 '' -print_info: UNK token = 3 '' -print_info: PAD token = 0 '' -print_info: LF token = 248 '<0x0A>' -print_info: EOG token = 106 '' -print_info: max token length = 48 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: ROCm0 model buffer size = 13721.20 MiB -load_tensors: ROCm_Host model buffer size = 1920.47 MiB -............................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 0.125 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 1.00 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 256.00 MiB -llama_kv_cache_unified: size = 256.00 MiB ( 4096 cells, 8 layers, 1/ 1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 480.00 MiB -llama_kv_cache_unified: size = 480.00 MiB ( 1536 cells, 40 layers, 1/ 1 seqs), K (f16): 240.00 MiB, V (f16): 240.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 519.62 MiB -llama_context: ROCm_Host compute buffer size = 11.01 MiB -llama_context: graph nodes = 2025 -llama_context: graph splits = 1 -common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting -common_init_from_params: added logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 3471752321 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello** - -llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 35294.12 tokens per second) -llama_perf_context_print: load time = 2510.88 ms -llama_perf_context_print: prompt eval time = 74.99 ms / 2 tokens ( 37.49 ms per token, 26.67 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 79.74 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 6.594391168s - Run #3 status: 0 - → Avg over 3 runs: 6.686s diff --git a/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__rocm7_beta.log b/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__rocm7_beta.log deleted file mode 100644 index c06575b..0000000 --- a/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__rocm7_beta.log +++ /dev/null @@ -1,165 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: loaded meta data with 40 key-value pairs and 626 tensors from /home/kyuz0/models/gemma-3-12b-it-UD-Q8_K_XL/gemma-3-12b-it-UD-Q8_K_XL.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = gemma3 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Gemma-3-12B-It -llama_model_loader: - kv 3: general.finetune str = it -llama_model_loader: - kv 4: general.basename str = Gemma-3-12B-It -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 12B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: gemma3.context_length u32 = 131072 -llama_model_loader: - kv 9: gemma3.embedding_length u32 = 3840 -llama_model_loader: - kv 10: gemma3.block_count u32 = 48 -llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 15360 -llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 16 -llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 256 -llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 256 -llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024 -llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear -llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000 -llama_model_loader: - kv 21: tokenizer.ggml.model str = llama -llama_model_loader: - kv 22: tokenizer.ggml.pre str = default -llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... -llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... -llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... -llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2 -llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106 -llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3 -llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0 -llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false -llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... -llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false -llama_model_loader: - kv 34: general.quantization_version u32 = 2 -llama_model_loader: - kv 35: general.file_type u32 = 7 -llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-12b-it-GGUF/imatrix_unsloth.dat -llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-12b-it.txt -llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 336 -llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663 -llama_model_loader: - type f32: 289 tensors -llama_model_loader: - type q8_0: 311 tensors -llama_model_loader: - type bf16: 26 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 13.40 GiB (9.78 BPW) -load: special tokens cache size = 6415 -load: token to piece cache size = 1.9446 MB -print_info: arch = gemma3 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 3840 -print_info: n_layer = 48 -print_info: n_head = 16 -print_info: n_head_kv = 8 -print_info: n_rot = 256 -print_info: n_swa = 1024 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 256 -print_info: n_embd_head_v = 256 -print_info: n_gqa = 2 -print_info: n_embd_k_gqa = 2048 -print_info: n_embd_v_gqa = 2048 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 6.2e-02 -print_info: n_ff = 15360 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 0.125 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 12B -print_info: model params = 11.77 B -print_info: general.name = Gemma-3-12B-It -print_info: vocab type = SPM -print_info: n_vocab = 262208 -print_info: n_merges = 0 -print_info: BOS token = 2 '' -print_info: EOS token = 106 '' -print_info: EOT token = 106 '' -print_info: UNK token = 3 '' -print_info: PAD token = 0 '' -print_info: LF token = 248 '<0x0A>' -print_info: EOG token = 106 '' -print_info: max token length = 48 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: ROCm0 model buffer size = 13721.20 MiB -load_tensors: ROCm_Host model buffer size = 1920.47 MiB -............................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 0.125 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 1.00 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 256.00 MiB -llama_kv_cache_unified: size = 256.00 MiB ( 4096 cells, 8 layers, 1/ 1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 480.00 MiB -llama_kv_cache_unified: size = 480.00 MiB ( 1536 cells, 40 layers, 1/ 1 seqs), K (f16): 240.00 MiB, V (f16): 240.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 519.62 MiB -llama_context: ROCm_Host compute buffer size = 11.01 MiB -llama_context: graph nodes = 2025 -llama_context: graph splits = 1 -common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting -common_init_from_params: added logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 854716185 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -HelloWhat - -llama_perf_sampler_print: sampling time = 0.14 ms / 3 runs ( 0.05 ms per token, 21428.57 tokens per second) -llama_perf_context_print: load time = 2695.72 ms -llama_perf_context_print: prompt eval time = 75.18 ms / 2 tokens ( 37.59 ms per token, 26.60 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 82.57 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 3.208919123s - Run #3 status: 0 - → Avg over 3 runs: 3.434s diff --git a/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__rocm7_rc.log b/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__rocm7_rc.log deleted file mode 100644 index 0c8b97e..0000000 --- a/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__rocm7_rc.log +++ /dev/null @@ -1,165 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: loaded meta data with 40 key-value pairs and 626 tensors from /home/kyuz0/models/gemma-3-12b-it-UD-Q8_K_XL/gemma-3-12b-it-UD-Q8_K_XL.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = gemma3 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Gemma-3-12B-It -llama_model_loader: - kv 3: general.finetune str = it -llama_model_loader: - kv 4: general.basename str = Gemma-3-12B-It -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 12B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: gemma3.context_length u32 = 131072 -llama_model_loader: - kv 9: gemma3.embedding_length u32 = 3840 -llama_model_loader: - kv 10: gemma3.block_count u32 = 48 -llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 15360 -llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 16 -llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 256 -llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 256 -llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024 -llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear -llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000 -llama_model_loader: - kv 21: tokenizer.ggml.model str = llama -llama_model_loader: - kv 22: tokenizer.ggml.pre str = default -llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... -llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... -llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... -llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2 -llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106 -llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3 -llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0 -llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false -llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... -llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false -llama_model_loader: - kv 34: general.quantization_version u32 = 2 -llama_model_loader: - kv 35: general.file_type u32 = 7 -llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-12b-it-GGUF/imatrix_unsloth.dat -llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-12b-it.txt -llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 336 -llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663 -llama_model_loader: - type f32: 289 tensors -llama_model_loader: - type q8_0: 311 tensors -llama_model_loader: - type bf16: 26 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 13.40 GiB (9.78 BPW) -load: special tokens cache size = 6415 -load: token to piece cache size = 1.9446 MB -print_info: arch = gemma3 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 3840 -print_info: n_layer = 48 -print_info: n_head = 16 -print_info: n_head_kv = 8 -print_info: n_rot = 256 -print_info: n_swa = 1024 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 256 -print_info: n_embd_head_v = 256 -print_info: n_gqa = 2 -print_info: n_embd_k_gqa = 2048 -print_info: n_embd_v_gqa = 2048 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 6.2e-02 -print_info: n_ff = 15360 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 0.125 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 12B -print_info: model params = 11.77 B -print_info: general.name = Gemma-3-12B-It -print_info: vocab type = SPM -print_info: n_vocab = 262208 -print_info: n_merges = 0 -print_info: BOS token = 2 '' -print_info: EOS token = 106 '' -print_info: EOT token = 106 '' -print_info: UNK token = 3 '' -print_info: PAD token = 0 '' -print_info: LF token = 248 '<0x0A>' -print_info: EOG token = 106 '' -print_info: max token length = 48 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: ROCm0 model buffer size = 13721.20 MiB -load_tensors: ROCm_Host model buffer size = 1920.47 MiB -............................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 0.125 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 1.00 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 256.00 MiB -llama_kv_cache_unified: size = 256.00 MiB ( 4096 cells, 8 layers, 1/ 1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 480.00 MiB -llama_kv_cache_unified: size = 480.00 MiB ( 1536 cells, 40 layers, 1/ 1 seqs), K (f16): 240.00 MiB, V (f16): 240.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 519.62 MiB -llama_context: ROCm_Host compute buffer size = 11.01 MiB -llama_context: graph nodes = 2025 -llama_context: graph splits = 1 -common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting -common_init_from_params: added logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 754281730 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -HelloThe - -llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 32608.70 tokens per second) -llama_perf_context_print: load time = 3090.57 ms -llama_perf_context_print: prompt eval time = 75.62 ms / 2 tokens ( 37.81 ms per token, 26.45 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 81.49 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 3.616272374s - Run #3 status: 0 - → Avg over 3 runs: 3.861s diff --git a/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__vulkan_amdvlk.log b/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__vulkan_amdvlk.log deleted file mode 100644 index 3e8f841..0000000 --- a/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__vulkan_amdvlk.log +++ /dev/null @@ -1,163 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat -build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free -llama_model_loader: loaded meta data with 40 key-value pairs and 626 tensors from /home/kyuz0/models/gemma-3-12b-it-UD-Q8_K_XL/gemma-3-12b-it-UD-Q8_K_XL.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = gemma3 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Gemma-3-12B-It -llama_model_loader: - kv 3: general.finetune str = it -llama_model_loader: - kv 4: general.basename str = Gemma-3-12B-It -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 12B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: gemma3.context_length u32 = 131072 -llama_model_loader: - kv 9: gemma3.embedding_length u32 = 3840 -llama_model_loader: - kv 10: gemma3.block_count u32 = 48 -llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 15360 -llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 16 -llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 256 -llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 256 -llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024 -llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear -llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000 -llama_model_loader: - kv 21: tokenizer.ggml.model str = llama -llama_model_loader: - kv 22: tokenizer.ggml.pre str = default -llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... -llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... -llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... -llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2 -llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106 -llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3 -llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0 -llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false -llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... -llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false -llama_model_loader: - kv 34: general.quantization_version u32 = 2 -llama_model_loader: - kv 35: general.file_type u32 = 7 -llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-12b-it-GGUF/imatrix_unsloth.dat -llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-12b-it.txt -llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 336 -llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663 -llama_model_loader: - type f32: 289 tensors -llama_model_loader: - type q8_0: 311 tensors -llama_model_loader: - type bf16: 26 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 13.40 GiB (9.78 BPW) -load: special tokens cache size = 6415 -load: token to piece cache size = 1.9446 MB -print_info: arch = gemma3 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 3840 -print_info: n_layer = 48 -print_info: n_head = 16 -print_info: n_head_kv = 8 -print_info: n_rot = 256 -print_info: n_swa = 1024 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 256 -print_info: n_embd_head_v = 256 -print_info: n_gqa = 2 -print_info: n_embd_k_gqa = 2048 -print_info: n_embd_v_gqa = 2048 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 6.2e-02 -print_info: n_ff = 15360 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 0.125 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 12B -print_info: model params = 11.77 B -print_info: general.name = Gemma-3-12B-It -print_info: vocab type = SPM -print_info: n_vocab = 262208 -print_info: n_merges = 0 -print_info: BOS token = 2 '' -print_info: EOS token = 106 '' -print_info: EOT token = 106 '' -print_info: UNK token = 3 '' -print_info: PAD token = 0 '' -print_info: LF token = 248 '<0x0A>' -print_info: EOG token = 106 '' -print_info: max token length = 48 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: Vulkan0 model buffer size = 13721.12 MiB -load_tensors: Vulkan_Host model buffer size = 1920.47 MiB -............................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 0.125 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 1.00 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 256.00 MiB -llama_kv_cache_unified: size = 256.00 MiB ( 4096 cells, 8 layers, 1/ 1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 480.00 MiB -llama_kv_cache_unified: size = 480.00 MiB ( 1536 cells, 40 layers, 1/ 1 seqs), K (f16): 240.00 MiB, V (f16): 240.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 519.62 MiB -llama_context: Vulkan_Host compute buffer size = 18.51 MiB -llama_context: graph nodes = 2025 -llama_context: graph splits = 2 -common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting -common_init_from_params: added logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 356896032 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello - -llama_perf_sampler_print: sampling time = 0.12 ms / 3 runs ( 0.04 ms per token, 24390.24 tokens per second) -llama_perf_context_print: load time = 3459.76 ms -llama_perf_context_print: prompt eval time = 90.54 ms / 2 tokens ( 45.27 ms per token, 22.09 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 98.48 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 3.933674345s - Run #3 status: 0 - → Avg over 3 runs: 3.955s diff --git a/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__vulkan_radv.log b/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__vulkan_radv.log deleted file mode 100644 index dcf49f7..0000000 --- a/benchmark/loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__vulkan_radv.log +++ /dev/null @@ -1,163 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free -llama_model_loader: loaded meta data with 40 key-value pairs and 626 tensors from /home/kyuz0/models/gemma-3-12b-it-UD-Q8_K_XL/gemma-3-12b-it-UD-Q8_K_XL.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = gemma3 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Gemma-3-12B-It -llama_model_loader: - kv 3: general.finetune str = it -llama_model_loader: - kv 4: general.basename str = Gemma-3-12B-It -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 12B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: gemma3.context_length u32 = 131072 -llama_model_loader: - kv 9: gemma3.embedding_length u32 = 3840 -llama_model_loader: - kv 10: gemma3.block_count u32 = 48 -llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 15360 -llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 16 -llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 256 -llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 256 -llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024 -llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear -llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000 -llama_model_loader: - kv 21: tokenizer.ggml.model str = llama -llama_model_loader: - kv 22: tokenizer.ggml.pre str = default -llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... -llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... -llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... -llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2 -llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106 -llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3 -llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0 -llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false -llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... -llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false -llama_model_loader: - kv 34: general.quantization_version u32 = 2 -llama_model_loader: - kv 35: general.file_type u32 = 7 -llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-12b-it-GGUF/imatrix_unsloth.dat -llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-12b-it.txt -llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 336 -llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663 -llama_model_loader: - type f32: 289 tensors -llama_model_loader: - type q8_0: 311 tensors -llama_model_loader: - type bf16: 26 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q8_0 -print_info: file size = 13.40 GiB (9.78 BPW) -load: special tokens cache size = 6415 -load: token to piece cache size = 1.9446 MB -print_info: arch = gemma3 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 3840 -print_info: n_layer = 48 -print_info: n_head = 16 -print_info: n_head_kv = 8 -print_info: n_rot = 256 -print_info: n_swa = 1024 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 256 -print_info: n_embd_head_v = 256 -print_info: n_gqa = 2 -print_info: n_embd_k_gqa = 2048 -print_info: n_embd_v_gqa = 2048 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 6.2e-02 -print_info: n_ff = 15360 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 0.125 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 12B -print_info: model params = 11.77 B -print_info: general.name = Gemma-3-12B-It -print_info: vocab type = SPM -print_info: n_vocab = 262208 -print_info: n_merges = 0 -print_info: BOS token = 2 '' -print_info: EOS token = 106 '' -print_info: EOT token = 106 '' -print_info: UNK token = 3 '' -print_info: PAD token = 0 '' -print_info: LF token = 248 '<0x0A>' -print_info: EOG token = 106 '' -print_info: max token length = 48 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 48 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 49/49 layers to GPU -load_tensors: Vulkan0 model buffer size = 13721.12 MiB -load_tensors: Vulkan_Host model buffer size = 1920.47 MiB -............................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 0.125 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 1.00 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 256.00 MiB -llama_kv_cache_unified: size = 256.00 MiB ( 4096 cells, 8 layers, 1/ 1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 480.00 MiB -llama_kv_cache_unified: size = 480.00 MiB ( 1536 cells, 40 layers, 1/ 1 seqs), K (f16): 240.00 MiB, V (f16): 240.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 519.62 MiB -llama_context: Vulkan_Host compute buffer size = 18.51 MiB -llama_context: graph nodes = 2025 -llama_context: graph splits = 2 -common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting -common_init_from_params: added logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 3541901199 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -HelloI - -llama_perf_sampler_print: sampling time = 0.12 ms / 3 runs ( 0.04 ms per token, 24590.16 tokens per second) -llama_perf_context_print: load time = 3946.08 ms -llama_perf_context_print: prompt eval time = 78.51 ms / 2 tokens ( 39.26 ms per token, 25.47 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 86.43 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 4.313578800s - Run #3 status: 0 - → Avg over 3 runs: 4.295s diff --git a/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2.log b/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2.log deleted file mode 100644 index 4834b42..0000000 --- a/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2.log +++ /dev/null @@ -1,164 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 808 tensors from /home/kyuz0/models/gemma-3-27b-it-BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = gemma3 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It -llama_model_loader: - kv 3: general.finetune str = it -llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 27B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: gemma3.context_length u32 = 131072 -llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376 -llama_model_loader: - kv 10: gemma3.block_count u32 = 62 -llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504 -llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32 -llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128 -llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128 -llama_model_loader: - kv 16: general.file_type u32 = 32 -llama_model_loader: - kv 17: gemma3.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 18: gemma3.attention.sliding_window u32 = 1024 -llama_model_loader: - kv 19: gemma3.attention.head_count_kv u32 = 16 -llama_model_loader: - kv 20: gemma3.rope.scaling.type str = linear -llama_model_loader: - kv 21: gemma3.rope.scaling.factor f32 = 8.000000 -llama_model_loader: - kv 22: general.quantization_version u32 = 2 -llama_model_loader: - kv 23: tokenizer.ggml.model str = llama -llama_model_loader: - kv 24: tokenizer.ggml.pre str = default -llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... -llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... -llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... -llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 2 -llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 106 -llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 3 -llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 0 -llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 33: tokenizer.ggml.add_eos_token bool = false -llama_model_loader: - kv 34: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... -llama_model_loader: - kv 35: tokenizer.ggml.add_space_prefix bool = false -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.count u16 = 2 -llama_model_loader: - kv 38: split.tensors.count i32 = 808 -llama_model_loader: - type f32: 373 tensors -llama_model_loader: - type bf16: 435 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 50.31 GiB (16.00 BPW) -load: special tokens cache size = 6415 -load: token to piece cache size = 1.9446 MB -print_info: arch = gemma3 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 5376 -print_info: n_layer = 62 -print_info: n_head = 32 -print_info: n_head_kv = 16 -print_info: n_rot = 128 -print_info: n_swa = 1024 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 2 -print_info: n_embd_k_gqa = 2048 -print_info: n_embd_v_gqa = 2048 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 7.7e-02 -print_info: n_ff = 21504 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 0.125 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 27B -print_info: model params = 27.01 B -print_info: general.name = Gemma-3-27B-It -print_info: vocab type = SPM -print_info: n_vocab = 262208 -print_info: n_merges = 0 -print_info: BOS token = 2 '' -print_info: EOS token = 106 '' -print_info: EOT token = 106 '' -print_info: UNK token = 3 '' -print_info: PAD token = 0 '' -print_info: LF token = 248 '<0x0A>' -print_info: EOG token = 106 '' -print_info: max token length = 48 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 62 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 63/63 layers to GPU -load_tensors: ROCm0 model buffer size = 51518.82 MiB -load_tensors: ROCm_Host model buffer size = 2688.66 MiB -............................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 0.125 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 1.00 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 320.00 MiB -llama_kv_cache_unified: size = 320.00 MiB ( 4096 cells, 10 layers, 1/ 1 seqs), K (f16): 160.00 MiB, V (f16): 160.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 624.00 MiB -llama_kv_cache_unified: size = 624.00 MiB ( 1536 cells, 52 layers, 1/ 1 seqs), K (f16): 312.00 MiB, V (f16): 312.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 522.62 MiB -llama_context: ROCm_Host compute buffer size = 11.01 MiB -llama_context: graph nodes = 2613 -llama_context: graph splits = 1 -common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting -common_init_from_params: added logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 204092650 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello - -llama_perf_sampler_print: sampling time = 0.08 ms / 3 runs ( 0.03 ms per token, 39473.68 tokens per second) -llama_perf_context_print: load time = 7815.59 ms -llama_perf_context_print: prompt eval time = 253.33 ms / 2 tokens ( 126.66 ms per token, 7.89 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 258.00 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 11.830337249s - Run #3 status: 0 - → Avg over 3 runs: 12.495s diff --git a/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__rocm7_beta.log b/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__rocm7_beta.log deleted file mode 100644 index 55f27ac..0000000 --- a/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__rocm7_beta.log +++ /dev/null @@ -1,164 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 808 tensors from /home/kyuz0/models/gemma-3-27b-it-BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = gemma3 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It -llama_model_loader: - kv 3: general.finetune str = it -llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 27B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: gemma3.context_length u32 = 131072 -llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376 -llama_model_loader: - kv 10: gemma3.block_count u32 = 62 -llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504 -llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32 -llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128 -llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128 -llama_model_loader: - kv 16: general.file_type u32 = 32 -llama_model_loader: - kv 17: gemma3.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 18: gemma3.attention.sliding_window u32 = 1024 -llama_model_loader: - kv 19: gemma3.attention.head_count_kv u32 = 16 -llama_model_loader: - kv 20: gemma3.rope.scaling.type str = linear -llama_model_loader: - kv 21: gemma3.rope.scaling.factor f32 = 8.000000 -llama_model_loader: - kv 22: general.quantization_version u32 = 2 -llama_model_loader: - kv 23: tokenizer.ggml.model str = llama -llama_model_loader: - kv 24: tokenizer.ggml.pre str = default -llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... -llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... -llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... -llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 2 -llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 106 -llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 3 -llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 0 -llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 33: tokenizer.ggml.add_eos_token bool = false -llama_model_loader: - kv 34: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... -llama_model_loader: - kv 35: tokenizer.ggml.add_space_prefix bool = false -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.count u16 = 2 -llama_model_loader: - kv 38: split.tensors.count i32 = 808 -llama_model_loader: - type f32: 373 tensors -llama_model_loader: - type bf16: 435 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 50.31 GiB (16.00 BPW) -load: special tokens cache size = 6415 -load: token to piece cache size = 1.9446 MB -print_info: arch = gemma3 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 5376 -print_info: n_layer = 62 -print_info: n_head = 32 -print_info: n_head_kv = 16 -print_info: n_rot = 128 -print_info: n_swa = 1024 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 2 -print_info: n_embd_k_gqa = 2048 -print_info: n_embd_v_gqa = 2048 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 7.7e-02 -print_info: n_ff = 21504 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 0.125 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 27B -print_info: model params = 27.01 B -print_info: general.name = Gemma-3-27B-It -print_info: vocab type = SPM -print_info: n_vocab = 262208 -print_info: n_merges = 0 -print_info: BOS token = 2 '' -print_info: EOS token = 106 '' -print_info: EOT token = 106 '' -print_info: UNK token = 3 '' -print_info: PAD token = 0 '' -print_info: LF token = 248 '<0x0A>' -print_info: EOG token = 106 '' -print_info: max token length = 48 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 62 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 63/63 layers to GPU -load_tensors: ROCm0 model buffer size = 51518.82 MiB -load_tensors: ROCm_Host model buffer size = 2688.66 MiB -............................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 0.125 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 1.00 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 320.00 MiB -llama_kv_cache_unified: size = 320.00 MiB ( 4096 cells, 10 layers, 1/ 1 seqs), K (f16): 160.00 MiB, V (f16): 160.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 624.00 MiB -llama_kv_cache_unified: size = 624.00 MiB ( 1536 cells, 52 layers, 1/ 1 seqs), K (f16): 312.00 MiB, V (f16): 312.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 522.62 MiB -llama_context: ROCm_Host compute buffer size = 11.01 MiB -llama_context: graph nodes = 2613 -llama_context: graph splits = 1 -common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting -common_init_from_params: added logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 88592582 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello, - -llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 35294.12 tokens per second) -llama_perf_context_print: load time = 10385.57 ms -llama_perf_context_print: prompt eval time = 253.71 ms / 2 tokens ( 126.85 ms per token, 7.88 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 259.35 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 11.144656718s - Run #3 status: 0 - → Avg over 3 runs: 10.486s diff --git a/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__rocm7_rc.log b/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__rocm7_rc.log deleted file mode 100644 index acb8825..0000000 --- a/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__rocm7_rc.log +++ /dev/null @@ -1,164 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 808 tensors from /home/kyuz0/models/gemma-3-27b-it-BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = gemma3 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It -llama_model_loader: - kv 3: general.finetune str = it -llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 27B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: gemma3.context_length u32 = 131072 -llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376 -llama_model_loader: - kv 10: gemma3.block_count u32 = 62 -llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504 -llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32 -llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128 -llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128 -llama_model_loader: - kv 16: general.file_type u32 = 32 -llama_model_loader: - kv 17: gemma3.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 18: gemma3.attention.sliding_window u32 = 1024 -llama_model_loader: - kv 19: gemma3.attention.head_count_kv u32 = 16 -llama_model_loader: - kv 20: gemma3.rope.scaling.type str = linear -llama_model_loader: - kv 21: gemma3.rope.scaling.factor f32 = 8.000000 -llama_model_loader: - kv 22: general.quantization_version u32 = 2 -llama_model_loader: - kv 23: tokenizer.ggml.model str = llama -llama_model_loader: - kv 24: tokenizer.ggml.pre str = default -llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... -llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... -llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... -llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 2 -llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 106 -llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 3 -llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 0 -llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 33: tokenizer.ggml.add_eos_token bool = false -llama_model_loader: - kv 34: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... -llama_model_loader: - kv 35: tokenizer.ggml.add_space_prefix bool = false -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.count u16 = 2 -llama_model_loader: - kv 38: split.tensors.count i32 = 808 -llama_model_loader: - type f32: 373 tensors -llama_model_loader: - type bf16: 435 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 50.31 GiB (16.00 BPW) -load: special tokens cache size = 6415 -load: token to piece cache size = 1.9446 MB -print_info: arch = gemma3 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 5376 -print_info: n_layer = 62 -print_info: n_head = 32 -print_info: n_head_kv = 16 -print_info: n_rot = 128 -print_info: n_swa = 1024 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 2 -print_info: n_embd_k_gqa = 2048 -print_info: n_embd_v_gqa = 2048 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 7.7e-02 -print_info: n_ff = 21504 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 0.125 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 27B -print_info: model params = 27.01 B -print_info: general.name = Gemma-3-27B-It -print_info: vocab type = SPM -print_info: n_vocab = 262208 -print_info: n_merges = 0 -print_info: BOS token = 2 '' -print_info: EOS token = 106 '' -print_info: EOT token = 106 '' -print_info: UNK token = 3 '' -print_info: PAD token = 0 '' -print_info: LF token = 248 '<0x0A>' -print_info: EOG token = 106 '' -print_info: max token length = 48 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 62 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 63/63 layers to GPU -load_tensors: ROCm0 model buffer size = 51518.82 MiB -load_tensors: ROCm_Host model buffer size = 2688.66 MiB -............................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 0.125 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 1.00 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 320.00 MiB -llama_kv_cache_unified: size = 320.00 MiB ( 4096 cells, 10 layers, 1/ 1 seqs), K (f16): 160.00 MiB, V (f16): 160.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells -llama_kv_cache_unified: ROCm0 KV buffer size = 624.00 MiB -llama_kv_cache_unified: size = 624.00 MiB ( 1536 cells, 52 layers, 1/ 1 seqs), K (f16): 312.00 MiB, V (f16): 312.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 522.62 MiB -llama_context: ROCm_Host compute buffer size = 11.01 MiB -llama_context: graph nodes = 2613 -llama_context: graph splits = 1 -common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting -common_init_from_params: added logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 1422263455 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello, - -llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 35294.12 tokens per second) -llama_perf_context_print: load time = 9620.16 ms -llama_perf_context_print: prompt eval time = 256.55 ms / 2 tokens ( 128.27 ms per token, 7.80 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 261.63 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 10.587027979s - Run #3 status: 0 - → Avg over 3 runs: 10.417s diff --git a/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__vulkan_amdvlk.log b/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__vulkan_amdvlk.log deleted file mode 100644 index d13050e..0000000 --- a/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__vulkan_amdvlk.log +++ /dev/null @@ -1,113 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat -build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 808 tensors from /home/kyuz0/models/gemma-3-27b-it-BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = gemma3 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It -llama_model_loader: - kv 3: general.finetune str = it -llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 27B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: gemma3.context_length u32 = 131072 -llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376 -llama_model_loader: - kv 10: gemma3.block_count u32 = 62 -llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504 -llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32 -llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128 -llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128 -llama_model_loader: - kv 16: general.file_type u32 = 32 -llama_model_loader: - kv 17: gemma3.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 18: gemma3.attention.sliding_window u32 = 1024 -llama_model_loader: - kv 19: gemma3.attention.head_count_kv u32 = 16 -llama_model_loader: - kv 20: gemma3.rope.scaling.type str = linear -llama_model_loader: - kv 21: gemma3.rope.scaling.factor f32 = 8.000000 -llama_model_loader: - kv 22: general.quantization_version u32 = 2 -llama_model_loader: - kv 23: tokenizer.ggml.model str = llama -llama_model_loader: - kv 24: tokenizer.ggml.pre str = default -llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... -llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... -llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... -llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 2 -llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 106 -llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 3 -llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 0 -llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 33: tokenizer.ggml.add_eos_token bool = false -llama_model_loader: - kv 34: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... -llama_model_loader: - kv 35: tokenizer.ggml.add_space_prefix bool = false -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.count u16 = 2 -llama_model_loader: - kv 38: split.tensors.count i32 = 808 -llama_model_loader: - type f32: 373 tensors -llama_model_loader: - type bf16: 435 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 50.31 GiB (16.00 BPW) -load: special tokens cache size = 6415 -load: token to piece cache size = 1.9446 MB -print_info: arch = gemma3 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 5376 -print_info: n_layer = 62 -print_info: n_head = 32 -print_info: n_head_kv = 16 -print_info: n_rot = 128 -print_info: n_swa = 1024 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 2 -print_info: n_embd_k_gqa = 2048 -print_info: n_embd_v_gqa = 2048 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 7.7e-02 -print_info: n_ff = 21504 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 0.125 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 27B -print_info: model params = 27.01 B -print_info: general.name = Gemma-3-27B-It -print_info: vocab type = SPM -print_info: n_vocab = 262208 -print_info: n_merges = 0 -print_info: BOS token = 2 '' -print_info: EOS token = 106 '' -print_info: EOT token = 106 '' -print_info: UNK token = 3 '' -print_info: PAD token = 0 '' -print_info: LF token = 248 '<0x0A>' -print_info: EOG token = 106 '' -print_info: max token length = 48 -load_tensors: loading model tensors, this can take a while... (mmap = false) -ggml_vulkan: Device memory allocation of size 2819260416 failed. -ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory -alloc_tensor_range: failed to allocate Vulkan0 buffer of size 2819260416 -llama_model_load: error loading model: unable to allocate Vulkan0 buffer -llama_model_load_from_file_impl: failed to load model -common_init_from_params: failed to load model '/home/kyuz0/models/gemma-3-27b-it-BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf' -main: error: unable to load model - Elapsed #3: .416644024s - Run #3 status: 1 - ✖ run #3 failed - → No successful runs diff --git a/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__vulkan_radv.log b/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__vulkan_radv.log deleted file mode 100644 index 095974b..0000000 --- a/benchmark/loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__vulkan_radv.log +++ /dev/null @@ -1,162 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free -llama_model_loader: additional 1 GGUFs metadata loaded. -llama_model_loader: loaded meta data with 39 key-value pairs and 808 tensors from /home/kyuz0/models/gemma-3-27b-it-BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = gemma3 -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It -llama_model_loader: - kv 3: general.finetune str = it -llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It -llama_model_loader: - kv 5: general.quantized_by str = Unsloth -llama_model_loader: - kv 6: general.size_label str = 27B -llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth -llama_model_loader: - kv 8: gemma3.context_length u32 = 131072 -llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376 -llama_model_loader: - kv 10: gemma3.block_count u32 = 62 -llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504 -llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32 -llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 -llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128 -llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128 -llama_model_loader: - kv 16: general.file_type u32 = 32 -llama_model_loader: - kv 17: gemma3.rope.freq_base f32 = 1000000.000000 -llama_model_loader: - kv 18: gemma3.attention.sliding_window u32 = 1024 -llama_model_loader: - kv 19: gemma3.attention.head_count_kv u32 = 16 -llama_model_loader: - kv 20: gemma3.rope.scaling.type str = linear -llama_model_loader: - kv 21: gemma3.rope.scaling.factor f32 = 8.000000 -llama_model_loader: - kv 22: general.quantization_version u32 = 2 -llama_model_loader: - kv 23: tokenizer.ggml.model str = llama -llama_model_loader: - kv 24: tokenizer.ggml.pre str = default -llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... -llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... -llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... -llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 2 -llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 106 -llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 3 -llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 0 -llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 33: tokenizer.ggml.add_eos_token bool = false -llama_model_loader: - kv 34: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... -llama_model_loader: - kv 35: tokenizer.ggml.add_space_prefix bool = false -llama_model_loader: - kv 36: split.no u16 = 0 -llama_model_loader: - kv 37: split.count u16 = 2 -llama_model_loader: - kv 38: split.tensors.count i32 = 808 -llama_model_loader: - type f32: 373 tensors -llama_model_loader: - type bf16: 435 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = BF16 -print_info: file size = 50.31 GiB (16.00 BPW) -load: special tokens cache size = 6415 -load: token to piece cache size = 1.9446 MB -print_info: arch = gemma3 -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 5376 -print_info: n_layer = 62 -print_info: n_head = 32 -print_info: n_head_kv = 16 -print_info: n_rot = 128 -print_info: n_swa = 1024 -print_info: is_swa_any = 1 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 2 -print_info: n_embd_k_gqa = 2048 -print_info: n_embd_v_gqa = 2048 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-06 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 7.7e-02 -print_info: n_ff = 21504 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 2 -print_info: rope scaling = linear -print_info: freq_base_train = 1000000.0 -print_info: freq_scale_train = 0.125 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 27B -print_info: model params = 27.01 B -print_info: general.name = Gemma-3-27B-It -print_info: vocab type = SPM -print_info: n_vocab = 262208 -print_info: n_merges = 0 -print_info: BOS token = 2 '' -print_info: EOS token = 106 '' -print_info: EOT token = 106 '' -print_info: UNK token = 3 '' -print_info: PAD token = 0 '' -print_info: LF token = 248 '<0x0A>' -print_info: EOG token = 106 '' -print_info: max token length = 48 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 62 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 63/63 layers to GPU -load_tensors: Vulkan0 model buffer size = 51518.82 MiB -load_tensors: Vulkan_Host model buffer size = 2688.66 MiB -............................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 1000000.0 -llama_context: freq_scale = 0.125 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 1.00 MiB -llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 320.00 MiB -llama_kv_cache_unified: size = 320.00 MiB ( 4096 cells, 10 layers, 1/ 1 seqs), K (f16): 160.00 MiB, V (f16): 160.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells -llama_kv_cache_unified: Vulkan0 KV buffer size = 624.00 MiB -llama_kv_cache_unified: size = 624.00 MiB ( 1536 cells, 52 layers, 1/ 1 seqs), K (f16): 312.00 MiB, V (f16): 312.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 522.62 MiB -llama_context: Vulkan_Host compute buffer size = 21.51 MiB -llama_context: graph nodes = 2613 -llama_context: graph splits = 2 -common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting -common_init_from_params: added logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 4215263583 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello, - -llama_perf_sampler_print: sampling time = 0.18 ms / 3 runs ( 0.06 ms per token, 16666.67 tokens per second) -llama_perf_context_print: load time = 14451.51 ms -llama_perf_context_print: prompt eval time = 257.32 ms / 2 tokens ( 128.66 ms per token, 7.77 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 265.56 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 15.024330058s - Run #3 status: 0 - → Avg over 3 runs: 13.579s diff --git a/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__rocm6_4_2.log b/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__rocm6_4_2.log deleted file mode 100644 index dc8cd03..0000000 --- a/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__rocm6_4_2.log +++ /dev/null @@ -1,159 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free -llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama 3.1 70B Instruct 2024 12 -llama_model_loader: - kv 3: general.version str = 2024-12 -llama_model_loader: - kv 4: general.finetune str = Instruct -llama_model_loader: - kv 5: general.basename str = Llama-3.1 -llama_model_loader: - kv 6: general.size_label str = 70B -llama_model_loader: - kv 7: general.license str = llama3.1 -llama_model_loader: - kv 8: general.base_model.count u32 = 1 -llama_model_loader: - kv 9: general.base_model.0.name str = Llama 3.1 70B -llama_model_loader: - kv 10: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 12: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 13: general.languages arr[str,7] = ["fr", "it", "pt", "hi", "es", "th", ... -llama_model_loader: - kv 14: llama.block_count u32 = 80 -llama_model_loader: - kv 15: llama.context_length u32 = 131072 -llama_model_loader: - kv 16: llama.embedding_length u32 = 8192 -llama_model_loader: - kv 17: llama.feed_forward_length u32 = 28672 -llama_model_loader: - kv 18: llama.attention.head_count u32 = 64 -llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 20: llama.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 22: llama.attention.key_length u32 = 128 -llama_model_loader: - kv 23: llama.attention.value_length u32 = 128 -llama_model_loader: - kv 24: general.file_type u32 = 15 -llama_model_loader: - kv 25: llama.vocab_size u32 = 128256 -llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128 -llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 28: tokenizer.ggml.pre str = llama-bpe -llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 128000 -llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 128009 -llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 35: general.quantization_version u32 = 2 -llama_model_loader: - type f32: 162 tensors -llama_model_loader: - type q4_K: 441 tensors -llama_model_loader: - type q5_K: 40 tensors -llama_model_loader: - type q6_K: 81 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q4_K - Medium -print_info: file size = 39.59 GiB (4.82 BPW) -load: special tokens cache size = 256 -load: token to piece cache size = 0.7999 MB -print_info: arch = llama -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 28672 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 70.55 B -print_info: general.name = Llama 3.1 70B Instruct 2024 12 -print_info: vocab type = BPE -print_info: n_vocab = 128256 -print_info: n_merges = 280147 -print_info: BOS token = 128000 '<|begin_of_text|>' -print_info: EOS token = 128009 '<|eot_id|>' -print_info: EOT token = 128009 '<|eot_id|>' -print_info: EOM token = 128008 '<|eom_id|>' -print_info: LF token = 198 'Ċ' -print_info: EOG token = 128001 '<|end_of_text|>' -print_info: EOG token = 128008 '<|eom_id|>' -print_info: EOG token = 128009 '<|eot_id|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 80 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 81/81 layers to GPU -load_tensors: CPU model buffer size = 563.62 MiB -load_tensors: ROCm0 model buffer size = 39979.48 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.49 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB -llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 266.50 MiB -llama_context: ROCm_Host compute buffer size = 24.01 MiB -llama_context: graph nodes = 2647 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eom_id|> logit bias = -inf -common_init_from_params: added <|eot_id|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 1295757489 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello, - -llama_perf_sampler_print: sampling time = 0.05 ms / 3 runs ( 0.02 ms per token, 61224.49 tokens per second) -llama_perf_context_print: load time = 5592.62 ms -llama_perf_context_print: prompt eval time = 248.28 ms / 2 tokens ( 124.14 ms per token, 8.06 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 263.25 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 9.635053314s - Run #3 status: 0 - → Avg over 3 runs: 9.887s diff --git a/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__rocm7_beta.log b/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__rocm7_beta.log deleted file mode 100644 index 3dd2b4b..0000000 --- a/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__rocm7_beta.log +++ /dev/null @@ -1,159 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama 3.1 70B Instruct 2024 12 -llama_model_loader: - kv 3: general.version str = 2024-12 -llama_model_loader: - kv 4: general.finetune str = Instruct -llama_model_loader: - kv 5: general.basename str = Llama-3.1 -llama_model_loader: - kv 6: general.size_label str = 70B -llama_model_loader: - kv 7: general.license str = llama3.1 -llama_model_loader: - kv 8: general.base_model.count u32 = 1 -llama_model_loader: - kv 9: general.base_model.0.name str = Llama 3.1 70B -llama_model_loader: - kv 10: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 12: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 13: general.languages arr[str,7] = ["fr", "it", "pt", "hi", "es", "th", ... -llama_model_loader: - kv 14: llama.block_count u32 = 80 -llama_model_loader: - kv 15: llama.context_length u32 = 131072 -llama_model_loader: - kv 16: llama.embedding_length u32 = 8192 -llama_model_loader: - kv 17: llama.feed_forward_length u32 = 28672 -llama_model_loader: - kv 18: llama.attention.head_count u32 = 64 -llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 20: llama.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 22: llama.attention.key_length u32 = 128 -llama_model_loader: - kv 23: llama.attention.value_length u32 = 128 -llama_model_loader: - kv 24: general.file_type u32 = 15 -llama_model_loader: - kv 25: llama.vocab_size u32 = 128256 -llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128 -llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 28: tokenizer.ggml.pre str = llama-bpe -llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 128000 -llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 128009 -llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 35: general.quantization_version u32 = 2 -llama_model_loader: - type f32: 162 tensors -llama_model_loader: - type q4_K: 441 tensors -llama_model_loader: - type q5_K: 40 tensors -llama_model_loader: - type q6_K: 81 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q4_K - Medium -print_info: file size = 39.59 GiB (4.82 BPW) -load: special tokens cache size = 256 -load: token to piece cache size = 0.7999 MB -print_info: arch = llama -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 28672 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 70.55 B -print_info: general.name = Llama 3.1 70B Instruct 2024 12 -print_info: vocab type = BPE -print_info: n_vocab = 128256 -print_info: n_merges = 280147 -print_info: BOS token = 128000 '<|begin_of_text|>' -print_info: EOS token = 128009 '<|eot_id|>' -print_info: EOT token = 128009 '<|eot_id|>' -print_info: EOM token = 128008 '<|eom_id|>' -print_info: LF token = 198 'Ċ' -print_info: EOG token = 128001 '<|end_of_text|>' -print_info: EOG token = 128008 '<|eom_id|>' -print_info: EOG token = 128009 '<|eot_id|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 80 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 81/81 layers to GPU -load_tensors: CPU model buffer size = 563.62 MiB -load_tensors: ROCm0 model buffer size = 39979.48 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.49 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB -llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 266.50 MiB -llama_context: ROCm_Host compute buffer size = 24.01 MiB -llama_context: graph nodes = 2647 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eom_id|> logit bias = -inf -common_init_from_params: added <|eot_id|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 3791928713 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello. - -llama_perf_sampler_print: sampling time = 0.05 ms / 3 runs ( 0.02 ms per token, 57692.31 tokens per second) -llama_perf_context_print: load time = 6133.42 ms -llama_perf_context_print: prompt eval time = 247.67 ms / 2 tokens ( 123.83 ms per token, 8.08 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 268.37 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 6.904239282s - Run #3 status: 0 - → Avg over 3 runs: 9.338s diff --git a/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__rocm7_rc.log b/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__rocm7_rc.log deleted file mode 100644 index 687d9ff..0000000 --- a/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__rocm7_rc.log +++ /dev/null @@ -1,159 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free -llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama 3.1 70B Instruct 2024 12 -llama_model_loader: - kv 3: general.version str = 2024-12 -llama_model_loader: - kv 4: general.finetune str = Instruct -llama_model_loader: - kv 5: general.basename str = Llama-3.1 -llama_model_loader: - kv 6: general.size_label str = 70B -llama_model_loader: - kv 7: general.license str = llama3.1 -llama_model_loader: - kv 8: general.base_model.count u32 = 1 -llama_model_loader: - kv 9: general.base_model.0.name str = Llama 3.1 70B -llama_model_loader: - kv 10: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 12: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 13: general.languages arr[str,7] = ["fr", "it", "pt", "hi", "es", "th", ... -llama_model_loader: - kv 14: llama.block_count u32 = 80 -llama_model_loader: - kv 15: llama.context_length u32 = 131072 -llama_model_loader: - kv 16: llama.embedding_length u32 = 8192 -llama_model_loader: - kv 17: llama.feed_forward_length u32 = 28672 -llama_model_loader: - kv 18: llama.attention.head_count u32 = 64 -llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 20: llama.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 22: llama.attention.key_length u32 = 128 -llama_model_loader: - kv 23: llama.attention.value_length u32 = 128 -llama_model_loader: - kv 24: general.file_type u32 = 15 -llama_model_loader: - kv 25: llama.vocab_size u32 = 128256 -llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128 -llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 28: tokenizer.ggml.pre str = llama-bpe -llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 128000 -llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 128009 -llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 35: general.quantization_version u32 = 2 -llama_model_loader: - type f32: 162 tensors -llama_model_loader: - type q4_K: 441 tensors -llama_model_loader: - type q5_K: 40 tensors -llama_model_loader: - type q6_K: 81 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q4_K - Medium -print_info: file size = 39.59 GiB (4.82 BPW) -load: special tokens cache size = 256 -load: token to piece cache size = 0.7999 MB -print_info: arch = llama -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 28672 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 70.55 B -print_info: general.name = Llama 3.1 70B Instruct 2024 12 -print_info: vocab type = BPE -print_info: n_vocab = 128256 -print_info: n_merges = 280147 -print_info: BOS token = 128000 '<|begin_of_text|>' -print_info: EOS token = 128009 '<|eot_id|>' -print_info: EOT token = 128009 '<|eot_id|>' -print_info: EOM token = 128008 '<|eom_id|>' -print_info: LF token = 198 'Ċ' -print_info: EOG token = 128001 '<|end_of_text|>' -print_info: EOG token = 128008 '<|eom_id|>' -print_info: EOG token = 128009 '<|eot_id|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 80 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 81/81 layers to GPU -load_tensors: CPU model buffer size = 563.62 MiB -load_tensors: ROCm0 model buffer size = 39979.48 MiB -................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: ROCm_Host output buffer size = 0.49 MiB -llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB -llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: ROCm0 compute buffer size = 266.50 MiB -llama_context: ROCm_Host compute buffer size = 24.01 MiB -llama_context: graph nodes = 2647 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eom_id|> logit bias = -inf -common_init_from_params: added <|eot_id|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 59935472 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello. - -llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 46153.85 tokens per second) -llama_perf_context_print: load time = 12737.72 ms -llama_perf_context_print: prompt eval time = 291.99 ms / 2 tokens ( 145.99 ms per token, 6.85 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 306.96 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 13.680764475s - Run #3 status: 0 - → Avg over 3 runs: 14.602s diff --git a/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__vulkan_amdvlk.log b/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__vulkan_amdvlk.log deleted file mode 100644 index 267da60..0000000 --- a/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__vulkan_amdvlk.log +++ /dev/null @@ -1,157 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat -build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free -llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama 3.1 70B Instruct 2024 12 -llama_model_loader: - kv 3: general.version str = 2024-12 -llama_model_loader: - kv 4: general.finetune str = Instruct -llama_model_loader: - kv 5: general.basename str = Llama-3.1 -llama_model_loader: - kv 6: general.size_label str = 70B -llama_model_loader: - kv 7: general.license str = llama3.1 -llama_model_loader: - kv 8: general.base_model.count u32 = 1 -llama_model_loader: - kv 9: general.base_model.0.name str = Llama 3.1 70B -llama_model_loader: - kv 10: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 12: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 13: general.languages arr[str,7] = ["fr", "it", "pt", "hi", "es", "th", ... -llama_model_loader: - kv 14: llama.block_count u32 = 80 -llama_model_loader: - kv 15: llama.context_length u32 = 131072 -llama_model_loader: - kv 16: llama.embedding_length u32 = 8192 -llama_model_loader: - kv 17: llama.feed_forward_length u32 = 28672 -llama_model_loader: - kv 18: llama.attention.head_count u32 = 64 -llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 20: llama.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 22: llama.attention.key_length u32 = 128 -llama_model_loader: - kv 23: llama.attention.value_length u32 = 128 -llama_model_loader: - kv 24: general.file_type u32 = 15 -llama_model_loader: - kv 25: llama.vocab_size u32 = 128256 -llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128 -llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 28: tokenizer.ggml.pre str = llama-bpe -llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 128000 -llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 128009 -llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 35: general.quantization_version u32 = 2 -llama_model_loader: - type f32: 162 tensors -llama_model_loader: - type q4_K: 441 tensors -llama_model_loader: - type q5_K: 40 tensors -llama_model_loader: - type q6_K: 81 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q4_K - Medium -print_info: file size = 39.59 GiB (4.82 BPW) -load: special tokens cache size = 256 -load: token to piece cache size = 0.7999 MB -print_info: arch = llama -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 28672 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 70.55 B -print_info: general.name = Llama 3.1 70B Instruct 2024 12 -print_info: vocab type = BPE -print_info: n_vocab = 128256 -print_info: n_merges = 280147 -print_info: BOS token = 128000 '<|begin_of_text|>' -print_info: EOS token = 128009 '<|eot_id|>' -print_info: EOT token = 128009 '<|eot_id|>' -print_info: EOM token = 128008 '<|eom_id|>' -print_info: LF token = 198 'Ċ' -print_info: EOG token = 128001 '<|end_of_text|>' -print_info: EOG token = 128008 '<|eom_id|>' -print_info: EOG token = 128009 '<|eot_id|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 80 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 81/81 layers to GPU -load_tensors: Vulkan0 model buffer size = 39979.48 MiB -load_tensors: CPU model buffer size = 563.62 MiB -.................................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.49 MiB -llama_kv_cache_unified: Vulkan0 KV buffer size = 1280.00 MiB -llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 266.50 MiB -llama_context: Vulkan_Host compute buffer size = 24.01 MiB -llama_context: graph nodes = 2647 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eom_id|> logit bias = -inf -common_init_from_params: added <|eot_id|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 1976378490 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello, - -llama_perf_sampler_print: sampling time = 0.08 ms / 3 runs ( 0.03 ms per token, 36585.37 tokens per second) -llama_perf_context_print: load time = 6987.06 ms -llama_perf_context_print: prompt eval time = 210.77 ms / 2 tokens ( 105.39 ms per token, 9.49 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 232.45 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 7.786884955s - Run #3 status: 0 - → Avg over 3 runs: 9.176s diff --git a/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__vulkan_radv.log b/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__vulkan_radv.log deleted file mode 100644 index 326ab94..0000000 --- a/benchmark/loadtime_results/llama3.3-70.6B-Q4_K_M__vulkan_radv.log +++ /dev/null @@ -1,157 +0,0 @@ -ggml_vulkan: Found 1 Vulkan devices: -ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat -build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free -llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama 3.1 70B Instruct 2024 12 -llama_model_loader: - kv 3: general.version str = 2024-12 -llama_model_loader: - kv 4: general.finetune str = Instruct -llama_model_loader: - kv 5: general.basename str = Llama-3.1 -llama_model_loader: - kv 6: general.size_label str = 70B -llama_model_loader: - kv 7: general.license str = llama3.1 -llama_model_loader: - kv 8: general.base_model.count u32 = 1 -llama_model_loader: - kv 9: general.base_model.0.name str = Llama 3.1 70B -llama_model_loader: - kv 10: general.base_model.0.organization str = Meta Llama -llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... -llama_model_loader: - kv 12: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 13: general.languages arr[str,7] = ["fr", "it", "pt", "hi", "es", "th", ... -llama_model_loader: - kv 14: llama.block_count u32 = 80 -llama_model_loader: - kv 15: llama.context_length u32 = 131072 -llama_model_loader: - kv 16: llama.embedding_length u32 = 8192 -llama_model_loader: - kv 17: llama.feed_forward_length u32 = 28672 -llama_model_loader: - kv 18: llama.attention.head_count u32 = 64 -llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 20: llama.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 22: llama.attention.key_length u32 = 128 -llama_model_loader: - kv 23: llama.attention.value_length u32 = 128 -llama_model_loader: - kv 24: general.file_type u32 = 15 -llama_model_loader: - kv 25: llama.vocab_size u32 = 128256 -llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128 -llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 28: tokenizer.ggml.pre str = llama-bpe -llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 128000 -llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 128009 -llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... -llama_model_loader: - kv 35: general.quantization_version u32 = 2 -llama_model_loader: - type f32: 162 tensors -llama_model_loader: - type q4_K: 441 tensors -llama_model_loader: - type q5_K: 40 tensors -llama_model_loader: - type q6_K: 81 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q4_K - Medium -print_info: file size = 39.59 GiB (4.82 BPW) -load: special tokens cache size = 256 -load: token to piece cache size = 0.7999 MB -print_info: arch = llama -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 8192 -print_info: n_layer = 80 -print_info: n_head = 64 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 8 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 28672 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = 70B -print_info: model params = 70.55 B -print_info: general.name = Llama 3.1 70B Instruct 2024 12 -print_info: vocab type = BPE -print_info: n_vocab = 128256 -print_info: n_merges = 280147 -print_info: BOS token = 128000 '<|begin_of_text|>' -print_info: EOS token = 128009 '<|eot_id|>' -print_info: EOT token = 128009 '<|eot_id|>' -print_info: EOM token = 128008 '<|eom_id|>' -print_info: LF token = 198 'Ċ' -print_info: EOG token = 128001 '<|end_of_text|>' -print_info: EOG token = 128008 '<|eom_id|>' -print_info: EOG token = 128009 '<|eot_id|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = false) -load_tensors: offloading 80 repeating layers to GPU -load_tensors: offloading output layer to GPU -load_tensors: offloaded 81/81 layers to GPU -load_tensors: Vulkan0 model buffer size = 39979.48 MiB -load_tensors: CPU model buffer size = 563.62 MiB -.................................................................................................. -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 1 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: Vulkan_Host output buffer size = 0.49 MiB -llama_kv_cache_unified: Vulkan0 KV buffer size = 1280.00 MiB -llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: Vulkan0 compute buffer size = 266.50 MiB -llama_context: Vulkan_Host compute buffer size = 24.01 MiB -llama_context: graph nodes = 2647 -llama_context: graph splits = 2 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eom_id|> logit bias = -inf -common_init_from_params: added <|eot_id|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 16 - -system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | - -sampler seed: 2613669910 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1 - -Hello's - -llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 40540.54 tokens per second) -llama_perf_context_print: load time = 8119.06 ms -llama_perf_context_print: prompt eval time = 204.01 ms / 2 tokens ( 102.01 ms per token, 9.80 tokens per second) -llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) -llama_perf_context_print: total time = 225.18 ms / 3 tokens -llama_perf_context_print: graphs reused = 0 - Elapsed #3: 8.699816033s - Run #3 status: 0 - → Avg over 3 runs: 8.816s diff --git a/benchmark/parse_loadtime_results.py b/benchmark/parse_loadtime_results.py deleted file mode 100755 index 54d5297..0000000 --- a/benchmark/parse_loadtime_results.py +++ /dev/null @@ -1,71 +0,0 @@ -#!/usr/bin/env python3 -""" -Parse the console output of run_loadtime_benchmarks.sh stored in run_loadtime_benchmarks.log, -then produce a Markdown table of average load+inference times per model/env. -""" -import re -from collections import defaultdict, OrderedDict -import sys - -LOGFILE = 'run_loadtime_benchmark.log' -# Define expected environments in desired column order -ENV_ORDER = ['vulkan_radv','vulkan_amdvlk','rocm6_4_2','rocm7_beta','rocm7_rc'] - -# Regex patterns -ENTRY_RE = re.compile(r"✔ \[(?P[^]]+)\] (?P[^ ]+) avg=(?P[0-9.]+)s over (?P[0-9]+) runs") -FAIL_RE = re.compile(r"✖ \[(?P[^]]+)\] (?P[^ ]+) all runs failed") - -# Data containers -results = defaultdict(lambda: {}) # results[model][env] = float or 'ERR' - -# Read and parse log -with open(LOGFILE) as f: - for line in f: - line = line.strip() - m = ENTRY_RE.match(line) - if m: - env = m.group('env') - model = m.group('model') - avg = float(m.group('avg')) - results[model][env] = avg - continue - m2 = FAIL_RE.match(line) - if m2: - env = m2.group('env') - model = m2.group('model') - results[model][env] = None # indicate failure - -# Compute winner per model: smallest time -md_lines = [] -# Header -header = ['Model'] + [e.replace('_',' ').title() for e in ENV_ORDER] + ['Fastest'] -md_lines.append('| ' + ' | '.join(header) + ' |') -md_lines.append('|' + '|'.join(['---']*len(header)) + '|') - -for model in sorted(results, key=lambda s: s.lower()): - row = [f"**{model}**"] - env_times = results[model] - # find fastest - valid = {e:env_times[e] for e in ENV_ORDER if e in env_times and env_times[e] is not None} - if valid: - best_env = min(valid, key=lambda k: valid[k]) - fastest = f"🏆 **{best_env}**" - else: - fastest = '—' - for env in ENV_ORDER: - if env not in env_times: - cell = '—' - else: - t = env_times[env] - if t is None: - cell = '⚠️ Fail' - else: - cell = f"{t:.2f}s" - row.append(cell) - row.append(fastest) - md_lines.append('| ' + ' | '.join(row) + ' |') - -# Print markdown -table = '\n'.join(md_lines) -print(table) - diff --git a/benchmark/results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log b/benchmark/results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log deleted file mode 100644 index 3e888bb..0000000 --- a/benchmark/results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -Memory access fault by GPU node-1 (Agent handle: 0x275a2540) on address 0x7f3fb2c08000. Reason: Page not present or supervisor privilege. -✖ ! [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 failed (exit 134) diff --git a/benchmark/results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index 694fdbe..0000000 --- a/benchmark/results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x25d19540) reason :GPU Hang -✖ ! [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 __fa1 failed (exit 134) diff --git a/benchmark/results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log b/benchmark/results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log deleted file mode 100644 index d19e880..0000000 --- a/benchmark/results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| glm4moe 106B.A12B Q4_K - Medium | 68.01 GiB | 110.47 B | ROCm | 99 | 0 | pp512 | 131.14 ± 0.28 | -| glm4moe 106B.A12B Q4_K - Medium | 68.01 GiB | 110.47 B | ROCm | 99 | 0 | tg128 | 20.15 ± 0.01 | - -build: de219279 (6181) diff --git a/benchmark/results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2__fa1.log b/benchmark/results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2__fa1.log deleted file mode 100644 index e6a5d48..0000000 --- a/benchmark/results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| glm4moe 106B.A12B Q4_K - Medium | 68.01 GiB | 110.47 B | ROCm | 99 | 1 | 0 | pp512 | 104.12 ± 0.05 | -| glm4moe 106B.A12B Q4_K - Medium | 68.01 GiB | 110.47 B | ROCm | 99 | 1 | 0 | tg128 | 20.35 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2-rocwmma.log b/benchmark/results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2-rocwmma.log deleted file mode 100644 index 3ee3c3e..0000000 --- a/benchmark/results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x3e28b540) reason :GPU Hang -✖ ! [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 failed (exit 134) diff --git a/benchmark/results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index 2b15919..0000000 --- a/benchmark/results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -Memory access fault by GPU node-1 (Agent handle: 0x2bdf8540) on address 0x7f5f95e35000. Reason: Page not present or supervisor privilege. -✖ ! [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 __fa1 failed (exit 134) diff --git a/benchmark/results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2.log b/benchmark/results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2.log deleted file mode 100644 index 63bd38e..0000000 --- a/benchmark/results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x3ff2d540) reason :GPU Hang -✖ ! [rocm6_4_2] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 failed (exit 134) diff --git a/benchmark/results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2__fa1.log b/benchmark/results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2__fa1.log deleted file mode 100644 index 18f04dd..0000000 --- a/benchmark/results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2__fa1.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x3bb3540) reason :GPU Hang -✖ ! [rocm6_4_2] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 __fa1 failed (exit 134) diff --git a/benchmark/results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log b/benchmark/results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log deleted file mode 100644 index 4fb737a..0000000 --- a/benchmark/results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x33b8a540) reason :GPU Hang -✖ ! [rocm6_4_2-rocwmma] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 failed (exit 134) diff --git a/benchmark/results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index 8ed4e21..0000000 --- a/benchmark/results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x20e35540) reason :GPU Hang -✖ ! [rocm6_4_2-rocwmma] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 __fa1 failed (exit 134) diff --git a/benchmark/results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2.log b/benchmark/results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2.log deleted file mode 100644 index 8ad5ab6..0000000 --- a/benchmark/results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x1b1ea540) reason :GPU Hang -✖ ! [rocm6_4_2] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 failed (exit 134) diff --git a/benchmark/results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2__fa1.log b/benchmark/results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2__fa1.log deleted file mode 100644 index 7860063..0000000 --- a/benchmark/results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| llama 70B Q8_0 | 75.65 GiB | 70.55 B | ROCm | 99 | 1 | 0 | pp512 | 16.16 ± 0.02 | -| llama 70B Q8_0 | 75.65 GiB | 70.55 B | ROCm | 99 | 1 | 0 | tg128 | 2.78 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2-rocwmma.log b/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2-rocwmma.log deleted file mode 100644 index 6babfa8..0000000 --- a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x344ea540) reason :GPU Hang -✖ ! [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 failed (exit 134) diff --git a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index 2f3524f..0000000 --- a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0xe316540) reason :GPU Hang -✖ ! [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 __fa1 failed (exit 134) diff --git a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2.log b/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2.log deleted file mode 100644 index 1009e19..0000000 --- a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x17ade540) reason :GPU Hang -✖ ! [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 failed (exit 134) diff --git a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2__fa1.log b/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2__fa1.log deleted file mode 100644 index c7625db..0000000 --- a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2__fa1.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0xe91f540) reason :GPU Hang -✖ ! [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 __fa1 failed (exit 134) diff --git a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2-rocwmma.log b/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2-rocwmma.log deleted file mode 100644 index a8331aa..0000000 --- a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x1019d540) reason :GPU Hang -✖ ! [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 failed (exit 134) diff --git a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index b68e909..0000000 --- a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x2ff5c540) reason :GPU Hang -✖ ! [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 failed (exit 134) diff --git a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2.log b/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2.log deleted file mode 100644 index 88fb55c..0000000 --- a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x3db80540) reason :GPU Hang -✖ ! [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 failed (exit 134) diff --git a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2__fa1.log b/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2__fa1.log deleted file mode 100644 index befc174..0000000 --- a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2__fa1.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x24a4c540) reason :GPU Hang -✖ ! [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 failed (exit 134) diff --git a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log b/benchmark/results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log deleted file mode 100644 index 90103a9..0000000 --- a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -Memory access fault by GPU node-1 (Agent handle: 0x3e5ce540) on address 0x7f64d3b76000. Reason: Page not present or supervisor privilege. -✖ ! [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 failed (exit 134) diff --git a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index 2e11ead..0000000 --- a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x1239e540) reason :GPU Hang -✖ ! [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 __fa1 failed (exit 134) diff --git a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log b/benchmark/results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log deleted file mode 100644 index b311bf5..0000000 --- a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x101f4540) reason :GPU Hang -✖ ! [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 failed (exit 134) diff --git a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2__fa1.log b/benchmark/results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2__fa1.log deleted file mode 100644 index 8ac1834..0000000 --- a/benchmark/results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2__fa1.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -Memory access fault by GPU node-1 (Agent handle: 0x15f12540) on address 0x7ef17d976000. Reason: Page not present or supervisor privilege. -✖ ! [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 __fa1 failed (exit 134) diff --git a/benchmark/results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2-rocwmma.log b/benchmark/results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2-rocwmma.log deleted file mode 100644 index 1adbae9..0000000 --- a/benchmark/results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x2f5d1540) reason :GPU Hang -✖ ! [rocm6_4_2-rocwmma] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 failed (exit 134) diff --git a/benchmark/results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index 9a061f5..0000000 --- a/benchmark/results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0xdc93540) reason :GPU Hang -✖ ! [rocm6_4_2-rocwmma] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 __fa1 failed (exit 134) diff --git a/benchmark/results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2.log b/benchmark/results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2.log deleted file mode 100644 index a1be58c..0000000 --- a/benchmark/results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0xff7540) reason :GPU Hang -✖ ! [rocm6_4_2] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 failed (exit 134) diff --git a/benchmark/results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2__fa1.log b/benchmark/results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2__fa1.log deleted file mode 100644 index 281a126..0000000 --- a/benchmark/results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2__fa1.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x2607e540) reason :GPU Hang -✖ ! [rocm6_4_2] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 __fa1 failed (exit 134) diff --git a/benchmark/results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2-rocwmma.log b/benchmark/results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2-rocwmma.log deleted file mode 100644 index 4f8da20..0000000 --- a/benchmark/results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 0 | pp512 | 157.75 ± 2.58 | -| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 0 | tg128 | 24.62 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index 598f905..0000000 --- a/benchmark/results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 1 | 0 | pp512 | 161.90 ± 3.05 | -| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 1 | 0 | tg128 | 24.09 ± 0.02 | - -build: de219279 (6181) diff --git a/benchmark/results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2.log b/benchmark/results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2.log deleted file mode 100644 index ae13cea..0000000 --- a/benchmark/results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 0 | pp512 | 157.81 ± 2.51 | -| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 0 | tg128 | 24.61 ± 0.01 | - -build: de219279 (6181) diff --git a/benchmark/results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2__fa1.log b/benchmark/results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2__fa1.log deleted file mode 100644 index 2791323..0000000 --- a/benchmark/results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 1 | 0 | pp512 | 140.24 ± 1.86 | -| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 1 | 0 | tg128 | 24.46 ± 0.02 | - -build: de219279 (6181) diff --git a/benchmark/results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2-rocwmma.log b/benchmark/results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2-rocwmma.log deleted file mode 100644 index 14182c9..0000000 --- a/benchmark/results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 0 | pp512 | 387.23 ± 0.82 | -| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 0 | tg128 | 50.64 ± 0.01 | - -build: de219279 (6181) diff --git a/benchmark/results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index 63e83fb..0000000 --- a/benchmark/results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 1 | 0 | pp512 | 411.72 ± 1.04 | -| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 1 | 0 | tg128 | 48.78 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2.log b/benchmark/results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2.log deleted file mode 100644 index f33f7c4..0000000 --- a/benchmark/results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 0 | pp512 | 387.86 ± 1.41 | -| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 0 | tg128 | 50.65 ± 0.01 | - -build: de219279 (6181) diff --git a/benchmark/results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2__fa1.log b/benchmark/results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2__fa1.log deleted file mode 100644 index 928cc4b..0000000 --- a/benchmark/results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 1 | 0 | pp512 | 301.23 ± 0.49 | -| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 1 | 0 | tg128 | 50.07 ± 0.02 | - -build: de219279 (6181) diff --git a/benchmark/results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2-rocwmma.log b/benchmark/results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2-rocwmma.log deleted file mode 100644 index a6af248..0000000 --- a/benchmark/results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 0 | pp512 | 222.91 ± 0.21 | -| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 0 | tg128 | 14.03 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index 333ac47..0000000 --- a/benchmark/results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 1 | 0 | pp512 | 229.15 ± 0.24 | -| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 1 | 0 | tg128 | 13.76 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2.log b/benchmark/results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2.log deleted file mode 100644 index f26f454..0000000 --- a/benchmark/results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 0 | pp512 | 222.59 ± 0.24 | -| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 0 | tg128 | 14.03 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2__fa1.log b/benchmark/results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2__fa1.log deleted file mode 100644 index df5dd02..0000000 --- a/benchmark/results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 1 | 0 | pp512 | 197.89 ± 3.40 | -| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 1 | 0 | tg128 | 13.76 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2-rocwmma.log b/benchmark/results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2-rocwmma.log deleted file mode 100644 index 0219357..0000000 --- a/benchmark/results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| gemma3 27B BF16 | 50.31 GiB | 27.01 B | ROCm | 99 | 0 | pp512 | 87.20 ± 3.70 | -| gemma3 27B BF16 | 50.31 GiB | 27.01 B | ROCm | 99 | 0 | tg128 | 4.09 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index 8dcf6c9..0000000 --- a/benchmark/results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| gemma3 27B BF16 | 50.31 GiB | 27.01 B | ROCm | 99 | 1 | 0 | pp512 | 68.87 ± 14.37 | -| gemma3 27B BF16 | 50.31 GiB | 27.01 B | ROCm | 99 | 1 | 0 | tg128 | 4.08 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2.log b/benchmark/results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2.log deleted file mode 100644 index 627bb9e..0000000 --- a/benchmark/results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| gemma3 27B BF16 | 50.31 GiB | 27.01 B | ROCm | 99 | 0 | pp512 | 82.57 ± 10.36 | -| gemma3 27B BF16 | 50.31 GiB | 27.01 B | ROCm | 99 | 0 | tg128 | 4.09 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2__fa1.log b/benchmark/results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2__fa1.log deleted file mode 100644 index b35b468..0000000 --- a/benchmark/results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| gemma3 27B BF16 | 50.31 GiB | 27.01 B | ROCm | 99 | 1 | 0 | pp512 | 74.78 ± 10.12 | -| gemma3 27B BF16 | 50.31 GiB | 27.01 B | ROCm | 99 | 1 | 0 | tg128 | 4.09 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/gemma-3-4b-it-Q3_K_S__rocm6_4_2-rocwmma.log b/benchmark/results/gemma-3-4b-it-Q3_K_S__rocm6_4_2-rocwmma.log deleted file mode 100644 index 84fba20..0000000 --- a/benchmark/results/gemma-3-4b-it-Q3_K_S__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| gemma3 4B Q3_K - Small | 1.80 GiB | 3.88 B | ROCm | 99 | 0 | pp512 | 728.70 ± 1.28 | -| gemma3 4B Q3_K - Small | 1.80 GiB | 3.88 B | ROCm | 99 | 0 | tg128 | 76.63 ± 0.03 | - -build: de219279 (6181) diff --git a/benchmark/results/gemma-3-4b-it-Q3_K_S__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/gemma-3-4b-it-Q3_K_S__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index 073d72d..0000000 --- a/benchmark/results/gemma-3-4b-it-Q3_K_S__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| gemma3 4B Q3_K - Small | 1.80 GiB | 3.88 B | ROCm | 99 | 1 | 0 | pp512 | 752.52 ± 0.83 | -| gemma3 4B Q3_K - Small | 1.80 GiB | 3.88 B | ROCm | 99 | 1 | 0 | tg128 | 70.93 ± 0.02 | - -build: de219279 (6181) diff --git a/benchmark/results/gemma-3-4b-it-Q3_K_S__rocm6_4_2.log b/benchmark/results/gemma-3-4b-it-Q3_K_S__rocm6_4_2.log deleted file mode 100644 index 4ab8e49..0000000 --- a/benchmark/results/gemma-3-4b-it-Q3_K_S__rocm6_4_2.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| gemma3 4B Q3_K - Small | 1.80 GiB | 3.88 B | ROCm | 99 | 0 | pp512 | 729.33 ± 1.93 | -| gemma3 4B Q3_K - Small | 1.80 GiB | 3.88 B | ROCm | 99 | 0 | tg128 | 76.79 ± 0.03 | - -build: de219279 (6181) diff --git a/benchmark/results/gemma-3-4b-it-Q3_K_S__rocm6_4_2__fa1.log b/benchmark/results/gemma-3-4b-it-Q3_K_S__rocm6_4_2__fa1.log deleted file mode 100644 index bdf1afb..0000000 --- a/benchmark/results/gemma-3-4b-it-Q3_K_S__rocm6_4_2__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| gemma3 4B Q3_K - Small | 1.80 GiB | 3.88 B | ROCm | 99 | 1 | 0 | pp512 | 645.25 ± 0.89 | -| gemma3 4B Q3_K - Small | 1.80 GiB | 3.88 B | ROCm | 99 | 1 | 0 | tg128 | 70.31 ± 0.01 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-120b-F16__rocm6_4_2-rocwmma.log b/benchmark/results/gpt-oss-120b-F16__rocm6_4_2-rocwmma.log deleted file mode 100644 index d8257a4..0000000 --- a/benchmark/results/gpt-oss-120b-F16__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| gpt-oss ?B F16 | 60.87 GiB | 116.83 B | ROCm | 99 | 0 | pp512 | 355.59 ± 0.86 | -| gpt-oss ?B F16 | 60.87 GiB | 116.83 B | ROCm | 99 | 0 | tg128 | 33.97 ± 0.01 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-120b-F16__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/gpt-oss-120b-F16__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index c765b22..0000000 --- a/benchmark/results/gpt-oss-120b-F16__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| gpt-oss ?B F16 | 60.87 GiB | 116.83 B | ROCm | 99 | 1 | 0 | pp512 | 390.43 ± 0.70 | -| gpt-oss ?B F16 | 60.87 GiB | 116.83 B | ROCm | 99 | 1 | 0 | tg128 | 33.81 ± 0.01 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-120b-F16__rocm6_4_2.log b/benchmark/results/gpt-oss-120b-F16__rocm6_4_2.log deleted file mode 100644 index 306797a..0000000 --- a/benchmark/results/gpt-oss-120b-F16__rocm6_4_2.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| gpt-oss ?B F16 | 60.87 GiB | 116.83 B | ROCm | 99 | 0 | pp512 | 355.94 ± 1.35 | -| gpt-oss ?B F16 | 60.87 GiB | 116.83 B | ROCm | 99 | 0 | tg128 | 33.97 ± 0.01 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-120b-F16__rocm6_4_2__fa1.log b/benchmark/results/gpt-oss-120b-F16__rocm6_4_2__fa1.log deleted file mode 100644 index a49785b..0000000 --- a/benchmark/results/gpt-oss-120b-F16__rocm6_4_2__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| gpt-oss ?B F16 | 60.87 GiB | 116.83 B | ROCm | 99 | 1 | 0 | pp512 | 322.57 ± 0.31 | -| gpt-oss ?B F16 | 60.87 GiB | 116.83 B | ROCm | 99 | 1 | 0 | tg128 | 33.30 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2-rocwmma.log b/benchmark/results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2-rocwmma.log deleted file mode 100644 index 360ed4e..0000000 --- a/benchmark/results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| gpt-oss ?B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 0 | pp512 | 353.20 ± 0.30 | -| gpt-oss ?B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 0 | tg128 | 45.42 ± 0.01 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index 6969d6b..0000000 --- a/benchmark/results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| gpt-oss ?B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | 0 | pp512 | 387.10 ± 0.42 | -| gpt-oss ?B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | 0 | tg128 | 45.16 ± 0.01 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2.log b/benchmark/results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2.log deleted file mode 100644 index 32eae28..0000000 --- a/benchmark/results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2.log +++ /dev/null @@ -1,6 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -HW Exception by GPU node-1 (Agent handle: 0x2bea6540) reason :GPU Hang -✖ ! [rocm6_4_2] gpt-oss-120b-mxfp4-00001-of-00003 failed (exit 134) diff --git a/benchmark/results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2__fa1.log b/benchmark/results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2__fa1.log deleted file mode 100644 index 1d395f6..0000000 --- a/benchmark/results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| gpt-oss ?B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | 0 | pp512 | 319.84 ± 0.73 | -| gpt-oss ?B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 1 | 0 | tg128 | 44.43 ± 0.02 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-20b-F32__rocm6_4_2-rocwmma.log b/benchmark/results/gpt-oss-20b-F32__rocm6_4_2-rocwmma.log deleted file mode 100644 index 6698000..0000000 --- a/benchmark/results/gpt-oss-20b-F32__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| gpt-oss ?B BF16 | 38.97 GiB | 20.91 B | ROCm | 99 | 0 | pp512 | 324.30 ± 4.23 | -| gpt-oss ?B BF16 | 38.97 GiB | 20.91 B | ROCm | 99 | 0 | tg128 | 27.10 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-20b-F32__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/gpt-oss-20b-F32__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index 5f3dda7..0000000 --- a/benchmark/results/gpt-oss-20b-F32__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| gpt-oss ?B BF16 | 38.97 GiB | 20.91 B | ROCm | 99 | 1 | 0 | pp512 | 342.14 ± 4.83 | -| gpt-oss ?B BF16 | 38.97 GiB | 20.91 B | ROCm | 99 | 1 | 0 | tg128 | 27.05 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-20b-F32__rocm6_4_2.log b/benchmark/results/gpt-oss-20b-F32__rocm6_4_2.log deleted file mode 100644 index 5e1169a..0000000 --- a/benchmark/results/gpt-oss-20b-F32__rocm6_4_2.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| gpt-oss ?B BF16 | 38.97 GiB | 20.91 B | ROCm | 99 | 0 | pp512 | 324.36 ± 4.35 | -| gpt-oss ?B BF16 | 38.97 GiB | 20.91 B | ROCm | 99 | 0 | tg128 | 27.12 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-20b-F32__rocm6_4_2__fa1.log b/benchmark/results/gpt-oss-20b-F32__rocm6_4_2__fa1.log deleted file mode 100644 index a1a5a05..0000000 --- a/benchmark/results/gpt-oss-20b-F32__rocm6_4_2__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| gpt-oss ?B BF16 | 38.97 GiB | 20.91 B | ROCm | 99 | 1 | 0 | pp512 | 304.23 ± 3.73 | -| gpt-oss ?B BF16 | 38.97 GiB | 20.91 B | ROCm | 99 | 1 | 0 | tg128 | 26.85 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-20b-mxfp4__rocm6_4_2-rocwmma.log b/benchmark/results/gpt-oss-20b-mxfp4__rocm6_4_2-rocwmma.log deleted file mode 100644 index 28cbd07..0000000 --- a/benchmark/results/gpt-oss-20b-mxfp4__rocm6_4_2-rocwmma.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| gpt-oss ?B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 0 | pp512 | 582.60 ± 4.90 | -| gpt-oss ?B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 0 | tg128 | 64.91 ± 0.01 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-20b-mxfp4__rocm6_4_2-rocwmma__fa1.log b/benchmark/results/gpt-oss-20b-mxfp4__rocm6_4_2-rocwmma__fa1.log deleted file mode 100644 index 985eaf7..0000000 --- a/benchmark/results/gpt-oss-20b-mxfp4__rocm6_4_2-rocwmma__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| gpt-oss ?B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | 0 | pp512 | 644.05 ± 3.87 | -| gpt-oss ?B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | 0 | tg128 | 64.63 ± 0.01 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-20b-mxfp4__rocm6_4_2.log b/benchmark/results/gpt-oss-20b-mxfp4__rocm6_4_2.log deleted file mode 100644 index c8bf125..0000000 --- a/benchmark/results/gpt-oss-20b-mxfp4__rocm6_4_2.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | -| gpt-oss ?B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 0 | pp512 | 581.11 ± 2.96 | -| gpt-oss ?B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 0 | tg128 | 65.00 ± 0.02 | - -build: de219279 (6181) diff --git a/benchmark/results/gpt-oss-20b-mxfp4__rocm6_4_2__fa1.log b/benchmark/results/gpt-oss-20b-mxfp4__rocm6_4_2__fa1.log deleted file mode 100644 index 320f480..0000000 --- a/benchmark/results/gpt-oss-20b-mxfp4__rocm6_4_2__fa1.log +++ /dev/null @@ -1,10 +0,0 @@ -ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no -ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no -ggml_cuda_init: found 1 ROCm devices: - Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 -| model | size | params | backend | ngl | fa | mmap | test | t/s | -| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | -| gpt-oss ?B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | 0 | pp512 | 522.29 ± 2.36 | -| gpt-oss ?B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 1 | 0 | tg128 | 63.63 ± 0.00 | - -build: de219279 (6181) diff --git a/benchmark/run_benchmarks.log b/benchmark/run_benchmarks.log deleted file mode 100644 index b9e965c..0000000 --- a/benchmark/run_benchmarks.log +++ /dev/null @@ -1,1392 +0,0 @@ -Found 19 model(s) to bench: - • /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf - • /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf - • /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf - • /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf - • /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf - • /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf - • /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf - • /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf - • /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf - • /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf - • /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf - • /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf - • /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf - • /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf - • /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf - • /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf - • /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf - • /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf - • /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf - - -▶ [rocm7_rc] gemma-3-27b-it-BF16-00001-of-00002 - → log: results/gemma-3-27b-it-BF16-00001-of-00002__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf - - -▶ [rocm7_rc] gemma-3-27b-it-BF16-00001-of-00002 __fa1 - → log: results/gemma-3-27b-it-BF16-00001-of-00002__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_beta] gemma-3-27b-it-BF16-00001-of-00002 - → log: results/gemma-3-27b-it-BF16-00001-of-00002__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf - - -▶ [rocm7_beta] gemma-3-27b-it-BF16-00001-of-00002 __fa1 - → log: results/gemma-3-27b-it-BF16-00001-of-00002__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf -fa 1 - - -▶ [rocm6_4_2-rocwmma] gemma-3-27b-it-BF16-00001-of-00002 - → log: results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf - - -▶ [rocm6_4_2-rocwmma] gemma-3-27b-it-BF16-00001-of-00002 __fa1 - → log: results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf -fa 1 - - -▶ [vulkan_radv] gemma-3-27b-it-BF16-00001-of-00002 - → log: results/gemma-3-27b-it-BF16-00001-of-00002__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf - - -▶ [vulkan_radv] gemma-3-27b-it-BF16-00001-of-00002 __fa1 - → log: results/gemma-3-27b-it-BF16-00001-of-00002__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf -fa 1 - - -▶ [vulkan_amdvlk] gemma-3-27b-it-BF16-00001-of-00002 - → log: results/gemma-3-27b-it-BF16-00001-of-00002__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf - - * [vulkan_amdvlk] gemma-3-27b-it-BF16-00001-of-00002 : FAILED - -▶ [vulkan_amdvlk] gemma-3-27b-it-BF16-00001-of-00002 __fa1 - → log: results/gemma-3-27b-it-BF16-00001-of-00002__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf -fa 1 - - * [vulkan_amdvlk] gemma-3-27b-it-BF16-00001-of-00002 __fa1 : FAILED - -▶ [rocm6_4_2] gemma-3-27b-it-BF16-00001-of-00002 - → log: results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf - - -▶ [rocm6_4_2] gemma-3-27b-it-BF16-00001-of-00002 __fa1 - → log: results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_rc-rocwmma] gemma-3-27b-it-BF16-00001-of-00002 - → log: results/gemma-3-27b-it-BF16-00001-of-00002__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf - - -▶ [rocm7_rc-rocwmma] gemma-3-27b-it-BF16-00001-of-00002 __fa1 - → log: results/gemma-3-27b-it-BF16-00001-of-00002__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_rc] gemma-3-12b-it-UD-Q8_K_XL - → log: results/gemma-3-12b-it-UD-Q8_K_XL__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf - - -▶ [rocm7_rc] gemma-3-12b-it-UD-Q8_K_XL __fa1 - → log: results/gemma-3-12b-it-UD-Q8_K_XL__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf -fa 1 - - -▶ [rocm7_beta] gemma-3-12b-it-UD-Q8_K_XL - → log: results/gemma-3-12b-it-UD-Q8_K_XL__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf - - -▶ [rocm7_beta] gemma-3-12b-it-UD-Q8_K_XL __fa1 - → log: results/gemma-3-12b-it-UD-Q8_K_XL__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf -fa 1 - - -▶ [rocm6_4_2-rocwmma] gemma-3-12b-it-UD-Q8_K_XL - → log: results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf - - -▶ [rocm6_4_2-rocwmma] gemma-3-12b-it-UD-Q8_K_XL __fa1 - → log: results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf -fa 1 - - -▶ [vulkan_radv] gemma-3-12b-it-UD-Q8_K_XL - → log: results/gemma-3-12b-it-UD-Q8_K_XL__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf - - -▶ [vulkan_radv] gemma-3-12b-it-UD-Q8_K_XL __fa1 - → log: results/gemma-3-12b-it-UD-Q8_K_XL__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf -fa 1 - - -▶ [vulkan_amdvlk] gemma-3-12b-it-UD-Q8_K_XL - → log: results/gemma-3-12b-it-UD-Q8_K_XL__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf - - -▶ [vulkan_amdvlk] gemma-3-12b-it-UD-Q8_K_XL __fa1 - → log: results/gemma-3-12b-it-UD-Q8_K_XL__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf -fa 1 - - -▶ [rocm6_4_2] gemma-3-12b-it-UD-Q8_K_XL - → log: results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf - - -▶ [rocm6_4_2] gemma-3-12b-it-UD-Q8_K_XL __fa1 - → log: results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf -fa 1 - - -▶ [rocm7_rc-rocwmma] gemma-3-12b-it-UD-Q8_K_XL - → log: results/gemma-3-12b-it-UD-Q8_K_XL__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf - - -▶ [rocm7_rc-rocwmma] gemma-3-12b-it-UD-Q8_K_XL __fa1 - → log: results/gemma-3-12b-it-UD-Q8_K_XL__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-12b-it-UD-Q8_K_XL.gguf -fa 1 - - -▶ [rocm7_rc] gemma-3-4b-it-Q3_K_S - → log: results/gemma-3-4b-it-Q3_K_S__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf - - -▶ [rocm7_rc] gemma-3-4b-it-Q3_K_S __fa1 - → log: results/gemma-3-4b-it-Q3_K_S__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf -fa 1 - - -▶ [rocm7_beta] gemma-3-4b-it-Q3_K_S - → log: results/gemma-3-4b-it-Q3_K_S__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf - - -▶ [rocm7_beta] gemma-3-4b-it-Q3_K_S __fa1 - → log: results/gemma-3-4b-it-Q3_K_S__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf -fa 1 - - -▶ [rocm6_4_2-rocwmma] gemma-3-4b-it-Q3_K_S - → log: results/gemma-3-4b-it-Q3_K_S__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf - - -▶ [rocm6_4_2-rocwmma] gemma-3-4b-it-Q3_K_S __fa1 - → log: results/gemma-3-4b-it-Q3_K_S__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf -fa 1 - - -▶ [vulkan_radv] gemma-3-4b-it-Q3_K_S - → log: results/gemma-3-4b-it-Q3_K_S__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf - - -▶ [vulkan_radv] gemma-3-4b-it-Q3_K_S __fa1 - → log: results/gemma-3-4b-it-Q3_K_S__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf -fa 1 - - -▶ [vulkan_amdvlk] gemma-3-4b-it-Q3_K_S - → log: results/gemma-3-4b-it-Q3_K_S__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf - - -▶ [vulkan_amdvlk] gemma-3-4b-it-Q3_K_S __fa1 - → log: results/gemma-3-4b-it-Q3_K_S__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf -fa 1 - - -▶ [rocm6_4_2] gemma-3-4b-it-Q3_K_S - → log: results/gemma-3-4b-it-Q3_K_S__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf - - -▶ [rocm6_4_2] gemma-3-4b-it-Q3_K_S __fa1 - → log: results/gemma-3-4b-it-Q3_K_S__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf -fa 1 - - -▶ [rocm7_rc-rocwmma] gemma-3-4b-it-Q3_K_S - → log: results/gemma-3-4b-it-Q3_K_S__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf - - -▶ [rocm7_rc-rocwmma] gemma-3-4b-it-Q3_K_S __fa1 - → log: results/gemma-3-4b-it-Q3_K_S__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gemma-3/gemma-3-4b-it-Q3_K_S.gguf -fa 1 - - -▶ [rocm7_rc] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 - → log: results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf - - -▶ [rocm7_rc] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 __fa1 - → log: results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_beta] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 - → log: results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf - - -▶ [rocm7_beta] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 __fa1 - → log: results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 - → log: results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf - - * [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 : FAILED - -▶ [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 __fa1 - → log: results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [vulkan_radv] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 - → log: results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf - - -▶ [vulkan_radv] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 __fa1 - → log: results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [vulkan_amdvlk] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 - → log: results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf - - -▶ [vulkan_amdvlk] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 __fa1 - → log: results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [rocm6_4_2] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 - → log: results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf - - -▶ [rocm6_4_2] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 __fa1 - → log: results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 - - * [rocm6_4_2] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 __fa1 : FAILED - -▶ [rocm7_rc-rocwmma] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 - → log: results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf - - -▶ [rocm7_rc-rocwmma] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 __fa1 - → log: results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_rc] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 - → log: results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf - - -▶ [rocm7_rc] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 __fa1 - → log: results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf -fa 1 - - -▶ [rocm7_beta] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 - → log: results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf - - -▶ [rocm7_beta] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 __fa1 - → log: results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf -fa 1 - - -▶ [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 - → log: results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf - - * [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 : FAILED - -▶ [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 __fa1 - → log: results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf -fa 1 - - * [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 __fa1 : FAILED - -▶ [vulkan_radv] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 - → log: results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf - - -▶ [vulkan_radv] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 __fa1 - → log: results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf -fa 1 - - -▶ [vulkan_amdvlk] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 - → log: results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf - - -▶ [vulkan_amdvlk] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 __fa1 - → log: results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf -fa 1 - - -▶ [rocm6_4_2] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 - → log: results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf - - -▶ [rocm6_4_2] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 __fa1 - → log: results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf -fa 1 - - -▶ [rocm7_rc-rocwmma] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 - → log: results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf - - -▶ [rocm7_rc-rocwmma] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 __fa1 - → log: results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/GLM-4.5-Air/UD-Q6_K_XL/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf -fa 1 - - -▶ [rocm7_rc] gpt-oss-120b-F16 - → log: results/gpt-oss-120b-F16__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf - - -▶ [rocm7_rc] gpt-oss-120b-F16 __fa1 - → log: results/gpt-oss-120b-F16__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf -fa 1 - - -▶ [rocm7_beta] gpt-oss-120b-F16 - → log: results/gpt-oss-120b-F16__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf - - -▶ [rocm7_beta] gpt-oss-120b-F16 __fa1 - → log: results/gpt-oss-120b-F16__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf -fa 1 - - -▶ [rocm6_4_2-rocwmma] gpt-oss-120b-F16 - → log: results/gpt-oss-120b-F16__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf - - * [rocm6_4_2-rocwmma] gpt-oss-120b-F16 : FAILED - -▶ [rocm6_4_2-rocwmma] gpt-oss-120b-F16 __fa1 - → log: results/gpt-oss-120b-F16__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf -fa 1 - - -▶ [vulkan_radv] gpt-oss-120b-F16 - → log: results/gpt-oss-120b-F16__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf - - -▶ [vulkan_radv] gpt-oss-120b-F16 __fa1 - → log: results/gpt-oss-120b-F16__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf -fa 1 - - -▶ [vulkan_amdvlk] gpt-oss-120b-F16 - → log: results/gpt-oss-120b-F16__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf - - -▶ [vulkan_amdvlk] gpt-oss-120b-F16 __fa1 - → log: results/gpt-oss-120b-F16__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf -fa 1 - - -▶ [rocm6_4_2] gpt-oss-120b-F16 - → log: results/gpt-oss-120b-F16__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf - - -▶ [rocm6_4_2] gpt-oss-120b-F16 __fa1 - → log: results/gpt-oss-120b-F16__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf -fa 1 - - -▶ [rocm7_rc-rocwmma] gpt-oss-120b-F16 - → log: results/gpt-oss-120b-F16__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf - - -▶ [rocm7_rc-rocwmma] gpt-oss-120b-F16 __fa1 - → log: results/gpt-oss-120b-F16__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-F16.gguf -fa 1 - - -▶ [rocm7_rc] gpt-oss-120b-mxfp4-00001-of-00003 - → log: results/gpt-oss-120b-mxfp4-00001-of-00003__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf - - -▶ [rocm7_rc] gpt-oss-120b-mxfp4-00001-of-00003 __fa1 - → log: results/gpt-oss-120b-mxfp4-00001-of-00003__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 - - -▶ [rocm7_beta] gpt-oss-120b-mxfp4-00001-of-00003 - → log: results/gpt-oss-120b-mxfp4-00001-of-00003__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf - - -▶ [rocm7_beta] gpt-oss-120b-mxfp4-00001-of-00003 __fa1 - → log: results/gpt-oss-120b-mxfp4-00001-of-00003__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 - - -▶ [rocm6_4_2-rocwmma] gpt-oss-120b-mxfp4-00001-of-00003 - → log: results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf - - -▶ [rocm6_4_2-rocwmma] gpt-oss-120b-mxfp4-00001-of-00003 __fa1 - → log: results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 - - -▶ [vulkan_radv] gpt-oss-120b-mxfp4-00001-of-00003 - → log: results/gpt-oss-120b-mxfp4-00001-of-00003__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf - - -▶ [vulkan_radv] gpt-oss-120b-mxfp4-00001-of-00003 __fa1 - → log: results/gpt-oss-120b-mxfp4-00001-of-00003__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 - - -▶ [vulkan_amdvlk] gpt-oss-120b-mxfp4-00001-of-00003 - → log: results/gpt-oss-120b-mxfp4-00001-of-00003__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf - - -▶ [vulkan_amdvlk] gpt-oss-120b-mxfp4-00001-of-00003 __fa1 - → log: results/gpt-oss-120b-mxfp4-00001-of-00003__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 - - -▶ [rocm6_4_2] gpt-oss-120b-mxfp4-00001-of-00003 - → log: results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf - - * [rocm6_4_2] gpt-oss-120b-mxfp4-00001-of-00003 : FAILED - -▶ [rocm6_4_2] gpt-oss-120b-mxfp4-00001-of-00003 __fa1 - → log: results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 - - -▶ [rocm7_rc-rocwmma] gpt-oss-120b-mxfp4-00001-of-00003 - → log: results/gpt-oss-120b-mxfp4-00001-of-00003__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf - - -▶ [rocm7_rc-rocwmma] gpt-oss-120b-mxfp4-00001-of-00003 __fa1 - → log: results/gpt-oss-120b-mxfp4-00001-of-00003__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 - - -▶ [rocm7_rc] gpt-oss-20b-F32 - → log: results/gpt-oss-20b-F32__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf - - -▶ [rocm7_rc] gpt-oss-20b-F32 __fa1 - → log: results/gpt-oss-20b-F32__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf -fa 1 - - -▶ [rocm7_beta] gpt-oss-20b-F32 - → log: results/gpt-oss-20b-F32__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf - - -▶ [rocm7_beta] gpt-oss-20b-F32 __fa1 - → log: results/gpt-oss-20b-F32__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf -fa 1 - - -▶ [rocm6_4_2-rocwmma] gpt-oss-20b-F32 - → log: results/gpt-oss-20b-F32__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf - - -▶ [rocm6_4_2-rocwmma] gpt-oss-20b-F32 __fa1 - → log: results/gpt-oss-20b-F32__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf -fa 1 - - -▶ [vulkan_radv] gpt-oss-20b-F32 - → log: results/gpt-oss-20b-F32__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf - - -▶ [vulkan_radv] gpt-oss-20b-F32 __fa1 - → log: results/gpt-oss-20b-F32__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf -fa 1 - - -▶ [vulkan_amdvlk] gpt-oss-20b-F32 - → log: results/gpt-oss-20b-F32__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf - - -▶ [vulkan_amdvlk] gpt-oss-20b-F32 __fa1 - → log: results/gpt-oss-20b-F32__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf -fa 1 - - -▶ [rocm6_4_2] gpt-oss-20b-F32 - → log: results/gpt-oss-20b-F32__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf - - -▶ [rocm6_4_2] gpt-oss-20b-F32 __fa1 - → log: results/gpt-oss-20b-F32__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf -fa 1 - - -▶ [rocm7_rc-rocwmma] gpt-oss-20b-F32 - → log: results/gpt-oss-20b-F32__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf - - -▶ [rocm7_rc-rocwmma] gpt-oss-20b-F32 __fa1 - → log: results/gpt-oss-20b-F32__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-F32.gguf -fa 1 - - -▶ [rocm7_rc] gpt-oss-20b-mxfp4 - → log: results/gpt-oss-20b-mxfp4__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf - - -▶ [rocm7_rc] gpt-oss-20b-mxfp4 __fa1 - → log: results/gpt-oss-20b-mxfp4__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf -fa 1 - - -▶ [rocm7_beta] gpt-oss-20b-mxfp4 - → log: results/gpt-oss-20b-mxfp4__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf - - -▶ [rocm7_beta] gpt-oss-20b-mxfp4 __fa1 - → log: results/gpt-oss-20b-mxfp4__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf -fa 1 - - -▶ [rocm6_4_2-rocwmma] gpt-oss-20b-mxfp4 - → log: results/gpt-oss-20b-mxfp4__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf - - -▶ [rocm6_4_2-rocwmma] gpt-oss-20b-mxfp4 __fa1 - → log: results/gpt-oss-20b-mxfp4__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf -fa 1 - - -▶ [vulkan_radv] gpt-oss-20b-mxfp4 - → log: results/gpt-oss-20b-mxfp4__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf - - -▶ [vulkan_radv] gpt-oss-20b-mxfp4 __fa1 - → log: results/gpt-oss-20b-mxfp4__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf -fa 1 - - -▶ [vulkan_amdvlk] gpt-oss-20b-mxfp4 - → log: results/gpt-oss-20b-mxfp4__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf - - -▶ [vulkan_amdvlk] gpt-oss-20b-mxfp4 __fa1 - → log: results/gpt-oss-20b-mxfp4__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf -fa 1 - - -▶ [rocm6_4_2] gpt-oss-20b-mxfp4 - → log: results/gpt-oss-20b-mxfp4__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf - - -▶ [rocm6_4_2] gpt-oss-20b-mxfp4 __fa1 - → log: results/gpt-oss-20b-mxfp4__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf -fa 1 - - -▶ [rocm7_rc-rocwmma] gpt-oss-20b-mxfp4 - → log: results/gpt-oss-20b-mxfp4__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf - - -▶ [rocm7_rc-rocwmma] gpt-oss-20b-mxfp4 __fa1 - → log: results/gpt-oss-20b-mxfp4__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/gpt-oss-20b/gpt-oss-20b-mxfp4.gguf -fa 1 - - -▶ [rocm7_rc] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 - → log: results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf - - -▶ [rocm7_rc] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 __fa1 - → log: results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf -fa 1 - - * [rocm7_rc] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 __fa1 : FAILED - -▶ [rocm7_beta] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 - → log: results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf - - -▶ [rocm7_beta] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 __fa1 - → log: results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf -fa 1 - - * [rocm7_beta] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 __fa1 : FAILED - -▶ [rocm6_4_2-rocwmma] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 - → log: results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf - - -▶ [rocm6_4_2-rocwmma] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 __fa1 - → log: results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf -fa 1 - - * [rocm6_4_2-rocwmma] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 __fa1 : FAILED - -▶ [vulkan_radv] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 - → log: results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf - - -▶ [vulkan_radv] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 __fa1 - → log: results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [vulkan_amdvlk] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 - → log: results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf - - * [vulkan_amdvlk] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 : FAILED - -▶ [vulkan_amdvlk] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 __fa1 - → log: results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf -fa 1 - - * [vulkan_amdvlk] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 __fa1 : FAILED - -▶ [rocm6_4_2] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 - → log: results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf - - * [rocm6_4_2] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 : FAILED - -▶ [rocm6_4_2] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 __fa1 - → log: results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf -fa 1 - - * [rocm6_4_2] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 __fa1 : FAILED - -▶ [rocm7_rc-rocwmma] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 - → log: results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf - - -▶ [rocm7_rc-rocwmma] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 __fa1 - → log: results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_rc] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 - → log: results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf - - -▶ [rocm7_rc] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 __fa1 - → log: results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf -fa 1 - - * [rocm7_rc] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 __fa1 : FAILED - -▶ [rocm7_beta] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 - → log: results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf - - -▶ [rocm7_beta] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 __fa1 - → log: results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf -fa 1 - - * [rocm7_beta] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 __fa1 : FAILED - -▶ [rocm6_4_2-rocwmma] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 - → log: results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf - - * [rocm6_4_2-rocwmma] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 : FAILED - -▶ [rocm6_4_2-rocwmma] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 __fa1 - → log: results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf -fa 1 - - * [rocm6_4_2-rocwmma] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 __fa1 : FAILED - -▶ [vulkan_radv] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 - → log: results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf - - -▶ [vulkan_radv] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 __fa1 - → log: results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [vulkan_amdvlk] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 - → log: results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf - - -▶ [vulkan_amdvlk] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 __fa1 - → log: results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [rocm6_4_2] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 - → log: results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf - - -▶ [rocm6_4_2] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 __fa1 - → log: results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_rc-rocwmma] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 - → log: results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf - - -▶ [rocm7_rc-rocwmma] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 __fa1 - → log: results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_rc] llama3.3-70.6B-Q4_K_M - → log: results/llama3.3-70.6B-Q4_K_M__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf - - -▶ [rocm7_rc] llama3.3-70.6B-Q4_K_M __fa1 - → log: results/llama3.3-70.6B-Q4_K_M__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf -fa 1 - - -▶ [rocm7_beta] llama3.3-70.6B-Q4_K_M - → log: results/llama3.3-70.6B-Q4_K_M__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf - - -▶ [rocm7_beta] llama3.3-70.6B-Q4_K_M __fa1 - → log: results/llama3.3-70.6B-Q4_K_M__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf -fa 1 - - -▶ [rocm6_4_2-rocwmma] llama3.3-70.6B-Q4_K_M - → log: results/llama3.3-70.6B-Q4_K_M__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf - - -▶ [rocm6_4_2-rocwmma] llama3.3-70.6B-Q4_K_M __fa1 - → log: results/llama3.3-70.6B-Q4_K_M__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf -fa 1 - - -▶ [vulkan_radv] llama3.3-70.6B-Q4_K_M - → log: results/llama3.3-70.6B-Q4_K_M__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf - - -▶ [vulkan_radv] llama3.3-70.6B-Q4_K_M __fa1 - → log: results/llama3.3-70.6B-Q4_K_M__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf -fa 1 - - -▶ [vulkan_amdvlk] llama3.3-70.6B-Q4_K_M - → log: results/llama3.3-70.6B-Q4_K_M__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf - - -▶ [vulkan_amdvlk] llama3.3-70.6B-Q4_K_M __fa1 - → log: results/llama3.3-70.6B-Q4_K_M__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf -fa 1 - - -▶ [rocm6_4_2] llama3.3-70.6B-Q4_K_M - → log: results/llama3.3-70.6B-Q4_K_M__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf - - -▶ [rocm6_4_2] llama3.3-70.6B-Q4_K_M __fa1 - → log: results/llama3.3-70.6B-Q4_K_M__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf -fa 1 - - -▶ [rocm7_rc-rocwmma] llama3.3-70.6B-Q4_K_M - → log: results/llama3.3-70.6B-Q4_K_M__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf - - -▶ [rocm7_rc-rocwmma] llama3.3-70.6B-Q4_K_M __fa1 - → log: results/llama3.3-70.6B-Q4_K_M__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf -fa 1 - - -▶ [rocm7_rc] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 - → log: results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf - - -▶ [rocm7_rc] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 - - * [rocm7_rc] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 __fa1 : FAILED - -▶ [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 - → log: results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf - - -▶ [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 - → log: results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf - - * [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 : FAILED - -▶ [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 - - * [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 __fa1 : FAILED - -▶ [vulkan_radv] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 - → log: results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf - - -▶ [vulkan_radv] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [vulkan_amdvlk] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 - → log: results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf - - -▶ [vulkan_amdvlk] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 - → log: results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf - - -▶ [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_rc-rocwmma] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 - → log: results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf - - * [rocm7_rc-rocwmma] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 : FAILED - -▶ [rocm7_rc-rocwmma] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_rc] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf - - -▶ [rocm7_rc] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf - - * [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 : FAILED - -▶ [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf -fa 1 - - * [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 __fa1 : FAILED - -▶ [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf - - * [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 : FAILED - -▶ [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf -fa 1 - - * [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 __fa1 : FAILED - -▶ [vulkan_radv] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf - - -▶ [vulkan_radv] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf -fa 1 - - -▶ [vulkan_amdvlk] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf - - -▶ [vulkan_amdvlk] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf -fa 1 - - -▶ [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf - - -▶ [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf -fa 1 - - * [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 __fa1 : FAILED - -▶ [rocm7_rc-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf - - -▶ [rocm7_rc-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_rc] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf - - -▶ [rocm7_rc] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf -fa 1 - - * [rocm7_rc] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 : FAILED - -▶ [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf - - -▶ [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf -fa 1 - - * [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 : FAILED - -▶ [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf - - * [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 : FAILED - -▶ [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf -fa 1 - - * [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 : FAILED - -▶ [vulkan_radv] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf - - -▶ [vulkan_radv] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf -fa 1 - - -▶ [vulkan_amdvlk] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf - - -▶ [vulkan_amdvlk] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf -fa 1 - - -▶ [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf - - -▶ [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf -fa 1 - - * [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 : FAILED - -▶ [rocm7_rc-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf - - -▶ [rocm7_rc-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 - → log: results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf -fa 1 - - -▶ [rocm7_rc] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 - → log: results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf - - -▶ [rocm7_rc] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 __fa1 - → log: results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf -fa 1 - - -▶ [rocm7_beta] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 - → log: results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf - - * [rocm7_beta] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 : FAILED - -▶ [rocm7_beta] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 __fa1 - → log: results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf -fa 1 - - * [rocm7_beta] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 __fa1 : FAILED - -▶ [rocm6_4_2-rocwmma] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 - → log: results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf - - * [rocm6_4_2-rocwmma] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 : FAILED - -▶ [rocm6_4_2-rocwmma] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 __fa1 - → log: results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf -fa 1 - - * [rocm6_4_2-rocwmma] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 __fa1 : FAILED - -▶ [vulkan_radv] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 - → log: results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf - - -▶ [vulkan_radv] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 __fa1 - → log: results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf -fa 1 - - -▶ [vulkan_amdvlk] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 - → log: results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf - - -▶ [vulkan_amdvlk] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 __fa1 - → log: results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf -fa 1 - - -▶ [rocm6_4_2] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 - → log: results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf - - -▶ [rocm6_4_2] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 __fa1 - → log: results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf -fa 1 - - -▶ [rocm7_rc-rocwmma] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 - → log: results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf - - -▶ [rocm7_rc-rocwmma] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 __fa1 - → log: results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf -fa 1 - - -▶ [rocm7_rc] Qwen3-30B-A3B-BF16-00001-of-00002 - → log: results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf - - -▶ [rocm7_rc] Qwen3-30B-A3B-BF16-00001-of-00002 __fa1 - → log: results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf -fa 1 - - * [rocm7_rc] Qwen3-30B-A3B-BF16-00001-of-00002 __fa1 : FAILED - -▶ [rocm7_beta] Qwen3-30B-A3B-BF16-00001-of-00002 - → log: results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf - - -▶ [rocm7_beta] Qwen3-30B-A3B-BF16-00001-of-00002 __fa1 - → log: results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf -fa 1 - - * [rocm7_beta] Qwen3-30B-A3B-BF16-00001-of-00002 __fa1 : FAILED - -▶ [rocm6_4_2-rocwmma] Qwen3-30B-A3B-BF16-00001-of-00002 - → log: results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf - - -▶ [rocm6_4_2-rocwmma] Qwen3-30B-A3B-BF16-00001-of-00002 __fa1 - → log: results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf -fa 1 - - -▶ [vulkan_radv] Qwen3-30B-A3B-BF16-00001-of-00002 - → log: results/Qwen3-30B-A3B-BF16-00001-of-00002__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf - - -▶ [vulkan_radv] Qwen3-30B-A3B-BF16-00001-of-00002 __fa1 - → log: results/Qwen3-30B-A3B-BF16-00001-of-00002__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf -fa 1 - - -▶ [vulkan_amdvlk] Qwen3-30B-A3B-BF16-00001-of-00002 - → log: results/Qwen3-30B-A3B-BF16-00001-of-00002__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf - - -▶ [vulkan_amdvlk] Qwen3-30B-A3B-BF16-00001-of-00002 __fa1 - → log: results/Qwen3-30B-A3B-BF16-00001-of-00002__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf -fa 1 - - -▶ [rocm6_4_2] Qwen3-30B-A3B-BF16-00001-of-00002 - → log: results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf - - -▶ [rocm6_4_2] Qwen3-30B-A3B-BF16-00001-of-00002 __fa1 - → log: results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_rc-rocwmma] Qwen3-30B-A3B-BF16-00001-of-00002 - → log: results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf - - -▶ [rocm7_rc-rocwmma] Qwen3-30B-A3B-BF16-00001-of-00002 __fa1 - → log: results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_rc] Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL - → log: results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf - - -▶ [rocm7_rc] Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL __fa1 - → log: results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf -fa 1 - - -▶ [rocm7_beta] Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL - → log: results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf - - -▶ [rocm7_beta] Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL __fa1 - → log: results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf -fa 1 - - -▶ [rocm6_4_2-rocwmma] Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL - → log: results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf - - -▶ [rocm6_4_2-rocwmma] Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL __fa1 - → log: results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf -fa 1 - - -▶ [vulkan_radv] Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL - → log: results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf - - -▶ [vulkan_radv] Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL __fa1 - → log: results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf -fa 1 - - -▶ [vulkan_amdvlk] Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL - → log: results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf - - -▶ [vulkan_amdvlk] Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL __fa1 - → log: results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf -fa 1 - - -▶ [rocm6_4_2] Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL - → log: results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf - - -▶ [rocm6_4_2] Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL __fa1 - → log: results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf -fa 1 - - -▶ [rocm7_rc-rocwmma] Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL - → log: results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf - - -▶ [rocm7_rc-rocwmma] Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL __fa1 - → log: results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen-3-30B-A3B/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL.gguf -fa 1 - - -▶ [rocm7_rc] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 - → log: results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm7_rc.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf - - -▶ [rocm7_rc] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 __fa1 - → log: results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm7_rc__fa1.log - → cmd: toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf -fa 1 - - -▶ [rocm7_beta] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 - → log: results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm7_beta.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf - - -▶ [rocm7_beta] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 __fa1 - → log: results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm7_beta__fa1.log - → cmd: toolbox run -c llama-rocm-7beta -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf -fa 1 - - * [rocm7_beta] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 __fa1 : FAILED - -▶ [rocm6_4_2-rocwmma] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 - → log: results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm6_4_2-rocwmma.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf - - -▶ [rocm6_4_2-rocwmma] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 __fa1 - → log: results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm6_4_2-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf -fa 1 - - -▶ [vulkan_radv] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 - → log: results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__vulkan_radv.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf - - -▶ [vulkan_radv] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 __fa1 - → log: results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__vulkan_radv__fa1.log - → cmd: toolbox run -c llama-vulkan-radv -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf -fa 1 - - -▶ [vulkan_amdvlk] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 - → log: results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__vulkan_amdvlk.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf - - -▶ [vulkan_amdvlk] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 __fa1 - → log: results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__vulkan_amdvlk__fa1.log - → cmd: toolbox run -c llama-vulkan-amdvlk -- /usr/sbin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf -fa 1 - - -▶ [rocm6_4_2] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 - → log: results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm6_4_2.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf - - -▶ [rocm6_4_2] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 __fa1 - → log: results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm6_4_2__fa1.log - → cmd: toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf -fa 1 - - * [rocm6_4_2] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 __fa1 : FAILED - -▶ [rocm7_rc-rocwmma] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 - → log: results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm7_rc-rocwmma.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf - - -▶ [rocm7_rc-rocwmma] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 __fa1 - → log: results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm7_rc-rocwmma__fa1.log - → cmd: toolbox run -c llama-rocm-7rc-rocwmma -- /usr/local/bin/llama-bench -ngl 999 -mmp 0 -m /mnt/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf -fa 1 - diff --git a/benchmark/run_benchmarks.sh b/benchmark/run_benchmarks.sh index 151e1d1..94f4708 100755 --- a/benchmark/run_benchmarks.sh +++ b/benchmark/run_benchmarks.sh @@ -26,8 +26,6 @@ done echo declare -A CMDS=( - [rocm6_4_2]="toolbox run -c llama-rocm-6.4.2 -- /usr/local/bin/llama-bench" - [rocm6_4_2-rocwmma]="toolbox run -c llama-rocm-6.4.2-rocwmma -- /usr/local/bin/llama-bench" [rocm6_4_3]="toolbox run -c llama-rocm-6.4.3 -- /usr/local/bin/llama-bench" [rocm6_4_3-rocwmma]="toolbox run -c llama-rocm-6.4.3-rocwmma -- /usr/local/bin/llama-bench" [rocm7_rc]="toolbox run -c llama-rocm-7rc -- /usr/local/bin/llama-bench" diff --git a/benchmark/run_loadtime_benchmark.log b/benchmark/run_loadtime_benchmark.log deleted file mode 100644 index c4de9de..0000000 --- a/benchmark/run_loadtime_benchmark.log +++ /dev/null @@ -1,277 +0,0 @@ -Found 11 models to test with llama-cli (3 runs each) - -▶ [rocm7_rc] gemma-3-12b-it-UD-Q8_K_XL (runs: 3) - → log : loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__rocm7_rc.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_rc] gemma-3-12b-it-UD-Q8_K_XL avg=3.861s over 3 runs - -▶ [rocm7_beta] gemma-3-12b-it-UD-Q8_K_XL (runs: 3) - → log : loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__rocm7_beta.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_beta] gemma-3-12b-it-UD-Q8_K_XL avg=3.434s over 3 runs - -▶ [vulkan_radv] gemma-3-12b-it-UD-Q8_K_XL (runs: 3) - → log : loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__vulkan_radv.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_radv] gemma-3-12b-it-UD-Q8_K_XL avg=4.295s over 3 runs - -▶ [vulkan_amdvlk] gemma-3-12b-it-UD-Q8_K_XL (runs: 3) - → log : loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__vulkan_amdvlk.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_amdvlk] gemma-3-12b-it-UD-Q8_K_XL avg=3.955s over 3 runs - -▶ [rocm6_4_2] gemma-3-12b-it-UD-Q8_K_XL (runs: 3) - → log : loadtime_results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm6_4_2] gemma-3-12b-it-UD-Q8_K_XL avg=6.686s over 3 runs - -▶ [rocm7_rc] gemma-3-27b-it-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__rocm7_rc.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_rc] gemma-3-27b-it-BF16-00001-of-00002 avg=10.417s over 3 runs - -▶ [rocm7_beta] gemma-3-27b-it-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__rocm7_beta.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_beta] gemma-3-27b-it-BF16-00001-of-00002 avg=10.486s over 3 runs - -▶ [vulkan_radv] gemma-3-27b-it-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__vulkan_radv.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_radv] gemma-3-27b-it-BF16-00001-of-00002 avg=13.579s over 3 runs - -▶ [vulkan_amdvlk] gemma-3-27b-it-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__vulkan_amdvlk.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✖ [vulkan_amdvlk] gemma-3-27b-it-BF16-00001-of-00002 all runs failed - -▶ [rocm6_4_2] gemma-3-27b-it-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm6_4_2] gemma-3-27b-it-BF16-00001-of-00002 avg=12.495s over 3 runs - -▶ [rocm7_rc] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm7_rc.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_rc] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 avg=26.362s over 3 runs - -▶ [rocm7_beta] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm7_beta.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_beta] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 avg=30.024s over 3 runs - -▶ [vulkan_radv] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__vulkan_radv.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_radv] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 avg=30.591s over 3 runs - -▶ [vulkan_amdvlk] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__vulkan_amdvlk.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✖ [vulkan_amdvlk] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 all runs failed - -▶ [rocm6_4_2] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002__rocm6_4_2.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm6_4_2] Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002 avg=35.301s over 3 runs - -▶ [rocm7_rc] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm7_rc.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_rc] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 avg=32.911s over 3 runs - -▶ [rocm7_beta] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm7_beta.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_beta] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 avg=32.796s over 3 runs - -▶ [vulkan_radv] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__vulkan_radv.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_radv] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 avg=30.376s over 3 runs - -▶ [vulkan_amdvlk] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__vulkan_amdvlk.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_amdvlk] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 avg=30.604s over 3 runs - -▶ [rocm6_4_2] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm6_4_2] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 avg=30.998s over 3 runs - -▶ [rocm7_rc] llama3.3-70.6B-Q4_K_M (runs: 3) - → log : loadtime_results/llama3.3-70.6B-Q4_K_M__rocm7_rc.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_rc] llama3.3-70.6B-Q4_K_M avg=14.602s over 3 runs - -▶ [rocm7_beta] llama3.3-70.6B-Q4_K_M (runs: 3) - → log : loadtime_results/llama3.3-70.6B-Q4_K_M__rocm7_beta.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_beta] llama3.3-70.6B-Q4_K_M avg=9.338s over 3 runs - -▶ [vulkan_radv] llama3.3-70.6B-Q4_K_M (runs: 3) - → log : loadtime_results/llama3.3-70.6B-Q4_K_M__vulkan_radv.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_radv] llama3.3-70.6B-Q4_K_M avg=8.816s over 3 runs - -▶ [vulkan_amdvlk] llama3.3-70.6B-Q4_K_M (runs: 3) - → log : loadtime_results/llama3.3-70.6B-Q4_K_M__vulkan_amdvlk.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_amdvlk] llama3.3-70.6B-Q4_K_M avg=9.176s over 3 runs - -▶ [rocm6_4_2] llama3.3-70.6B-Q4_K_M (runs: 3) - → log : loadtime_results/llama3.3-70.6B-Q4_K_M__rocm6_4_2.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm6_4_2] llama3.3-70.6B-Q4_K_M avg=9.887s over 3 runs - -▶ [rocm7_rc] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm7_rc.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_rc] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 avg=19.365s over 2 runs - -▶ [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm7_beta.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✖ [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 all runs failed - -▶ [vulkan_radv] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__vulkan_radv.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_radv] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 avg=20.045s over 3 runs - -▶ [vulkan_amdvlk] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__vulkan_amdvlk.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_amdvlk] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 avg=16.752s over 3 runs - -▶ [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 avg=15.776s over 3 runs - -▶ [rocm7_rc] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm7_rc.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_rc] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 avg=28.435s over 3 runs - -▶ [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm7_beta.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 avg=28.221s over 3 runs - -▶ [vulkan_radv] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__vulkan_radv.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_radv] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 avg=32.810s over 3 runs - -▶ [vulkan_amdvlk] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__vulkan_amdvlk.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_amdvlk] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 avg=35.541s over 3 runs - -▶ [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 avg=31.792s over 3 runs - -▶ [rocm7_rc] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm7_rc.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_rc] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 avg=35.742s over 3 runs - -▶ [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm7_beta.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_beta] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 avg=36.400s over 3 runs - -▶ [vulkan_radv] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__vulkan_radv.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_radv] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 avg=41.626s over 3 runs - -▶ [vulkan_amdvlk] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__vulkan_amdvlk.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_amdvlk] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 avg=47.967s over 3 runs - -▶ [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 (runs: 3) - → log : loadtime_results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 avg=40.739s over 3 runs - -▶ [rocm7_rc] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 (runs: 3) - → log : loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm7_rc.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_rc] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 avg=33.458s over 3 runs - -▶ [rocm7_beta] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 (runs: 3) - → log : loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm7_beta.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_beta] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 avg=35.392s over 3 runs - -▶ [vulkan_radv] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 (runs: 3) - → log : loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__vulkan_radv.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_radv] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 avg=40.722s over 3 runs - -▶ [vulkan_amdvlk] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 (runs: 3) - → log : loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__vulkan_amdvlk.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_amdvlk] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 avg=44.883s over 3 runs - -▶ [rocm6_4_2] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 (runs: 3) - → log : loadtime_results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm6_4_2] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 avg=39.062s over 3 runs - -▶ [rocm7_rc] Qwen3-30B-A3B-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm7_rc.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_rc] Qwen3-30B-A3B-BF16-00001-of-00002 avg=22.669s over 3 runs - -▶ [rocm7_beta] Qwen3-30B-A3B-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm7_beta.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_beta] Qwen3-30B-A3B-BF16-00001-of-00002 avg=15.930s over 3 runs - -▶ [vulkan_radv] Qwen3-30B-A3B-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__vulkan_radv.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_radv] Qwen3-30B-A3B-BF16-00001-of-00002 avg=14.761s over 3 runs - -▶ [vulkan_amdvlk] Qwen3-30B-A3B-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__vulkan_amdvlk.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_amdvlk] Qwen3-30B-A3B-BF16-00001-of-00002 avg=12.935s over 3 runs - -▶ [rocm6_4_2] Qwen3-30B-A3B-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm6_4_2] Qwen3-30B-A3B-BF16-00001-of-00002 avg=22.166s over 3 runs - -▶ [rocm7_rc] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm7_rc.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_rc] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 avg=16.161s over 3 runs - -▶ [rocm7_beta] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm7_beta.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm7_beta] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 avg=14.392s over 3 runs - -▶ [vulkan_radv] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__vulkan_radv.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_radv] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 avg=14.021s over 3 runs - -▶ [vulkan_amdvlk] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__vulkan_amdvlk.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [vulkan_amdvlk] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 avg=12.940s over 3 runs - -▶ [rocm6_4_2] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 (runs: 3) - → log : loadtime_results/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002__rocm6_4_2.log - → flags : -ngl 999 -fa --no-mmap -no-cnv -n 1 -✔ [rocm6_4_2] Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002 avg=17.779s over 3 runs - diff --git a/benchmark/run_loadtime_benchmark.sh b/benchmark/run_loadtime_benchmark.sh deleted file mode 100755 index 57612da..0000000 --- a/benchmark/run_loadtime_benchmark.sh +++ /dev/null @@ -1,88 +0,0 @@ -#!/usr/bin/env bash -# run_loadtime_benchmarks.sh -# Benchmark each model with llama-cli: measure load + single-token inference times (including load time) -# Run each model/env combination 3 times and compute average elapsed time -set -uo pipefail - -MODEL_DIR="$(realpath models)" -RESULTDIR="loadtime_results" -mkdir -p "$RESULTDIR" - -# 1) Gather one .gguf per model (single-file or first shard) -mapfile -t MODELS < <( - find "$MODEL_DIR" -type f -name '*.gguf' \ - \( -name '*-00001-of-*.gguf' -o ! -name '*-000*-of-*.gguf' \) \ - | sort -) -if (( ${#MODELS[@]} == 0 )); then - echo "❌ No models found in $MODEL_DIR" >&2 - exit 1 -fi - -echo "Found ${#MODELS[@]} models to test with llama-cli (3 runs each)" - -# 2) Define environments and llama-cli prefix -declare -A ENVS=( - [rocm6_4_2]="toolbox run -c llama-rocm-6.4.2 -- llama-cli" - [rocm7_beta]="toolbox run -c llama-rocm-7beta -- llama-cli" - [rocm7_rc]="toolbox run -c llama-rocm-7rc -- llama-cli" - [vulkan_amdvlk]="toolbox run -c llama-vulkan-amdvlk -- llama-cli" - [vulkan_radv]="toolbox run -c llama-vulkan-radv -- llama-cli" -) - -# Prompt and flags -PROMPT="Hello" -BASE_FLAGS=( -ngl 999 -fa --no-mmap -no-cnv -n 1 ) -REPEATS=3 - -# 3) Loop models/envs -for MODEL_PATH in "${MODELS[@]}"; do - MODEL_NAME="$(basename "${MODEL_PATH%.gguf}")" - - for ENV in "${!ENVS[@]}"; do - # Prepare output file - OUTFILE="$RESULTDIR/${MODEL_NAME}__${ENV}.log" - rm -f "$OUTFILE" - - # Build command prefix array - IFS=' ' read -r -a PREFIX_CMD <<< "${ENVS[$ENV]}" - FLAG_ARRAY=( "${BASE_FLAGS[@]}" ) - - echo - echo "▶ [$ENV] $MODEL_NAME (runs: $REPEATS)" - echo " → log : $OUTFILE" - echo " → flags : ${FLAG_ARRAY[*]}" - - sum=0 - success=0 - - for i in $(seq 1 $REPEATS); do - echo " Run #$i..." >>"$OUTFILE" - start=$(date +%s.%N) - # Run llama-cli; suppress its output to log (no tee) - "${PREFIX_CMD[@]}" "${FLAG_ARRAY[@]}" -m "$MODEL_PATH" -p "$PROMPT" >"$OUTFILE" 2>&1 - status=$? - end=$(date +%s.%N) - elapsed=$(echo "$end - $start" | bc) - echo " Elapsed #$i: ${elapsed}s" >>"$OUTFILE" - echo " Run #$i status: $status" >>"$OUTFILE" - - if [ $status -eq 0 ]; then - sum=$(echo "$sum + $elapsed" | bc) - ((success++)) - else - echo " ✖ run #$i failed" >>"$OUTFILE" - fi - done - - if [ $success -gt 0 ]; then - avg=$(echo "scale=3; $sum / $success" | bc) - echo " → Avg over $success runs: ${avg}s" >>"$OUTFILE" - echo "✔ [$ENV] $MODEL_NAME avg=${avg}s over $success runs" - else - echo " → No successful runs" >>"$OUTFILE" - echo "✖ [$ENV] $MODEL_NAME all runs failed" - fi - done -done - diff --git a/docs/benchmarks.md b/docs/benchmarks.md index 5d74e16..0a326fc 100644 --- a/docs/benchmarks.md +++ b/docs/benchmarks.md @@ -1,124 +1,158 @@ # AMD Strix Halo — llama.cpp Toolboxes (Benchmarks) -**Interactive results:** [https://kyuz0.github.io/amd-strix-halo-toolboxes/](https://kyuz0.github.io/amd-strix-halo-toolboxes/) +**Interactive results:** https://kyuz0.github.io/amd-strix-halo-toolboxes/ -* Filter by model name, size, and quantization -* Select backends with or without **Flash Attention** -* Compare pp512 and tg128 side-by-side -* Winners are computed using an **error-aware tolerance rule** — if two results overlap within their ± error margins, both are counted as winners. +## Table of Contents +- [Benchmark methodology](#benchmark-methodology) +- [Summary of current dataset (Flash Attention ON)](#summary-of-current-dataset-flash-attention-on) + - [Placement counts](#placement-counts) + - [Pairwise head-to-head wins](#pairwise-head-to-head-wins) + - [Average ranks](#average-ranks) +- [Analyses by feature](#analyses-by-feature) + - [Impact of Flash Attention](#impact-of-flash-attention) + - [Impact of ROCWMMA](#impact-of-rocwmma) + - [Impact of hipBLASLt](#impact-of-hipblaslt) + - [Vulkan: AMDVLK vs RADV](#vulkan-amdvlk-vs-radv) +- [Recommendations](#recommendations) +- [Winner calculation](#winner-calculation) --- ## Benchmark methodology -* **pp512** — prompt processing throughput (tokens/sec, prefill) -* **tg128** — token generation throughput (tokens/sec, interactive) -* Each backend tested twice per model: +- **pp512** — prompt processing throughput (tokens/sec, prefill) +- **tg128** — token generation throughput (tokens/sec, interactive) +- Each backend tested twice per model: `-fa 0` and `-fa 1` +- Winners per model/test are **margin-aware**; multiple winners are possible when mean±σ overlap +- Built from the same llama.cpp commit for consistency - * **Flash Attention OFF:** `-fa 0` - * **Flash Attention ON:** `-fa 1` -* Winners are determined per model using pooled ± error from all relevant runs; multiple winners are possible. -* All runs were built from the same `llama.cpp` commit for consistency. +**Backends in this dataset:** ROCm 7 RC + ROCWMMA + hipBLASLt, ROCm 7 RC (hipBLASLt), ROCm 7 RC (hipBLASLt OFF), ROCm 7 RC + ROCWMMA (hipBLASLt OFF), ROCm 6.4.3 (hipBLASLt), ROCm 6.4.3 (hipBLASLt OFF), ROCm 6.4.3 + ROCWMMA (hipBLASLt), ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF), Vulkan AMDVLK, Vulkan RADV -**Tested backends:** - -* Vulkan RADV -* Vulkan AMDVLK -* ROCm 6.4.2 -* ROCm 6.4.2 + ROCWMMA -* ROCm 7.x (beta / RC) -* ROCm 7.x + ROCWMMA + hipBLASLt - -**Note on ROCm 7 hipBLASLt:** -All ROCm 7 toolboxes ship with **hipBLASLt enabled by default** (`ROCBLAS_USE_HIPBLASLT=1`) because it improves performance and stability in most cases. -However, the benchmark script also includes runs with **hipBLASLt disabled** (`-hblt0`) so we can measure the impact directly. +**ROCm 7 hipBLASLt policy:** Toolboxes ship with **hipBLASLt enabled** by default (`ROCBLAS_USE_HIPBLASLT=1`). The benchmark script also runs **hipBLASLt OFF** variants (`-hblt0`) to measure its effect. --- -## Running benchmarks - -Place `.gguf` models in `models/` (for sharded models, include only the first shard: `*-00001-of-*.gguf`). - -Run: - -```bash -benchmark/run_benchmarks.sh -``` - -This will: - -* Detect models -* Execute each backend twice (FA off / FA on) -* Save logs in `benchmark/results/` - -Generate `results.json` for analysis: - -```bash -python benchmark/parse_results_to_json.py -``` - -Optional: print summary statistics: - -```bash -python benchmark/summarize_results.py -``` - ---- - -## Summary of current dataset (margin-aware, Flash Attention ON) - -### Prompt Processing (pp512) - -* **ROCm 7 RC + ROCWMMA + hipBLASLt** dominates — **15 wins/ties** out of 22 models. -* **Vulkan AMDVLK** is second most frequent winner (**4 wins/ties**) but can’t load certain architectures due to the ≤ 2 GiB single-buffer limit. -* **Vulkan RADV** rarely wins in PP but is highly stable. - -### Token Generation (tg128) - -* **Vulkan RADV** leads — **13 wins/ties** out of 15 possible. -* **Vulkan AMDVLK** is a strong second, usually just behind RADV in TG. -* **ROCm 7 RC + ROCWMMA + hipBLASLt** generally lags in TG but still posts competitive results for some models. - ---- - -### Placement counts (margin-aware, Flash Attention ON) +## Summary of current dataset (Flash Attention ON) +### Placement counts **Prompt Processing (pp512)** - -| Backend | 1st | 2nd | 3rd | -| ------------------------------- | -----: | --: | --: | -| ROCm 7 RC + ROCWMMA + hipBLASLt | **15** | 2 | 1 | -| Vulkan AMDVLK | 4 | 5 | 1 | -| Vulkan RADV | 0 | 2 | 2 | +| Backend | 1st | 2nd | 3rd | +| --- | ---: | ---: | ---: | +| ROCm 6.4.3 + ROCWMMA (hipBLASLt) | 9 | 5 | 0 | +| ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 3 | 3 | 8 | +| Vulkan AMDVLK | 3 | 0 | 2 | +| ROCm 7 RC + ROCWMMA + hipBLASLt | 1 | 8 | 4 | +| ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 0 | 0 | 1 | +| Vulkan RADV | 0 | 0 | 1 | **Token Generation (tg128)** +| Backend | 1st | 2nd | 3rd | +| --- | ---: | ---: | ---: | +| Vulkan RADV | 13 | 0 | 0 | +| ROCm 6.4.3 (hipBLASLt) | 3 | 0 | 1 | +| ROCm 6.4.3 + ROCWMMA (hipBLASLt) | 1 | 4 | 3 | +| ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 1 | 2 | 4 | +| ROCm 6.4.3 (hipBLASLt OFF) | 1 | 1 | 1 | +| ROCm 7 RC (hipBLASLt OFF) | 1 | 1 | 1 | +| ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 1 | 1 | 1 | +| ROCm 7 RC (hipBLASLt) | 1 | 0 | 4 | +| Vulkan AMDVLK | 0 | 10 | 0 | +| ROCm 7 RC + ROCWMMA + hipBLASLt | 0 | 1 | 2 | -| Backend | 1st | 2nd | 3rd | -| ------------------------------- | -----: | --: | --: | -| Vulkan RADV | **13** | 1 | 1 | -| Vulkan AMDVLK | 1 | 10 | 1 | -| ROCm 7 RC + ROCWMMA + hipBLASLt | 1 | 1 | 6 | +### Pairwise head-to-head wins +For any model+quant where both backends succeeded, this counts who was faster (ties when equal). +| Comparison | Test | A wins | B wins | Ties | Total | +| --- | --- | ---: | ---: | ---: | ---: | +| ROCm 7 RC + ROCWMMA + hipBLASLt vs Vulkan AMDVLK | pp512 | 11 | 4 | 0 | 15 | +| ROCm 7 RC + ROCWMMA + hipBLASLt vs Vulkan AMDVLK | tg128 | 4 | 10 | 1 | 15 | +| ROCm 7 RC + ROCWMMA + hipBLASLt vs Vulkan RADV | pp512 | 14 | 2 | 0 | 16 | +| ROCm 7 RC + ROCWMMA + hipBLASLt vs Vulkan RADV | tg128 | 3 | 13 | 0 | 16 | +| Vulkan AMDVLK vs Vulkan RADV | pp512 | 13 | 2 | 0 | 15 | +| Vulkan AMDVLK vs Vulkan RADV | tg128 | 2 | 13 | 0 | 15 | + +### Average ranks +**Prompt Processing (pp512)** +| Backend | Avg Rank (↓ is better) | +| --- | ---: | +| ROCm 6.4.3 + ROCWMMA (hipBLASLt) | 1.36 | +| Vulkan AMDVLK | 1.8 | +| ROCm 7 RC + ROCWMMA + hipBLASLt | 2.23 | +| ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 2.36 | +| ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 3.0 | +| Vulkan RADV | 3.0 | + +**Token Generation (tg128)** +| Backend | Avg Rank (↓ is better) | +| --- | ---: | +| Vulkan RADV | 1.0 | +| ROCm 6.4.3 (hipBLASLt) | 1.5 | +| Vulkan AMDVLK | 2.0 | +| ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 2.0 | +| ROCm 7 RC (hipBLASLt OFF) | 2.0 | +| ROCm 6.4.3 (hipBLASLt OFF) | 2.0 | +| ROCm 6.4.3 + ROCWMMA (hipBLASLt) | 2.25 | +| ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 2.43 | +| ROCm 7 RC (hipBLASLt) | 2.6 | +| ROCm 7 RC + ROCWMMA + hipBLASLt | 2.67 | --- -## Flash Attention +## Analyses by feature -* **ROCm 7 RC + ROCWMMA + hipBLASLt** benefits noticeably from Flash Attention ON in prompt processing, with no stability penalties recorded. -* **Vulkan AMDVLK** and **Vulkan RADV** show mixed changes — some models improve with FA, others slow down slightly. -* FA should be enabled or disabled **per model/backend** based on measured performance. +### Impact of Flash Attention +Median % change when **Flash Attention ON vs OFF**, paired by model+quant, per backend: +| Backend | pp512 Δ% (median, min..max, n) | tg128 Δ% (median, min..max, n) | +| --- | --- | --- | +| ROCm 7 RC + ROCWMMA + hipBLASLt | 8.4% (3.6..65.6), n=14 | -1.1% (-8.2..-0.3), n=14 | +| ROCm 7 RC (hipBLASLt) | -20.2% (-27.8..6.5), n=10 | -1.4% (-8.5..3.0), n=10 | +| ROCm 7 RC (hipBLASLt OFF) | -20.4% (-28.2..-16.1), n=9 | -1.9% (-8.6..0.1), n=9 | +| ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 5.8% (1.3..24.1), n=16 | -1.1% (-7.4..15.1), n=16 | +| ROCm 6.4.3 (hipBLASLt) | -19.5% (-25.7..-11.9), n=12 | -1.2% (-6.9..0.8), n=12 | +| ROCm 6.4.3 (hipBLASLt OFF) | -10.3% (-22.3..3.6), n=9 | -1.6% (-11.1..0.0), n=9 | +| ROCm 6.4.3 + ROCWMMA (hipBLASLt) | 10.9% (3.9..25.7), n=15 | -0.4% (-7.5..3.0), n=15 | +| ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 6.4% (1.8..12.3), n=10 | -0.6% (-6.5..2.3), n=10 | +| Vulkan AMDVLK | 1.1% (-45.4..20.2), n=15 | -1.5% (-28.6..0.1), n=15 | +| Vulkan RADV | 3.4% (-2.6..12.5), n=16 | 0.0% (-5.8..2.4), n=16 | + +### Impact of ROCWMMA +| Context | Test | Compared Envs | Pairs | Median Δ% | +| --- | --- | --- | ---: | ---: | +| ROCm 7 RC (hipBLASLt) | pp512 | ROCm 7 RC + ROCWMMA + hipBLASLt vs ROCm 7 RC (hipBLASLt) | 16 | 16.3% | +| ROCm 7 RC (hipBLASLt) | tg128 | ROCm 7 RC + ROCWMMA + hipBLASLt vs ROCm 7 RC (hipBLASLt) | 16 | -0.7% | +| ROCm 7 RC (hipBLASLt OFF) | pp512 | ROCm 7 RC + ROCWMMA (hipBLASLt OFF) vs ROCm 7 RC (hipBLASLt OFF) | 15 | 14.6% | +| ROCm 7 RC (hipBLASLt OFF) | tg128 | ROCm 7 RC + ROCWMMA (hipBLASLt OFF) vs ROCm 7 RC (hipBLASLt OFF) | 15 | -0.7% | +| ROCm 6.4.3 (hipBLASLt) | pp512 | ROCm 6.4.3 + ROCWMMA (hipBLASLt) vs ROCm 6.4.3 (hipBLASLt) | 15 | 17.4% | +| ROCm 6.4.3 (hipBLASLt) | tg128 | ROCm 6.4.3 + ROCWMMA (hipBLASLt) vs ROCm 6.4.3 (hipBLASLt) | 15 | -0.3% | +| ROCm 6.4.3 (hipBLASLt OFF) | pp512 | ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) vs ROCm 6.4.3 (hipBLASLt OFF) | 9 | 10.2% | +| ROCm 6.4.3 (hipBLASLt OFF) | tg128 | ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) vs ROCm 6.4.3 (hipBLASLt OFF) | 9 | 0.3% | + +### Impact of hipBLASLt +| Context | Test | Compared Envs | Pairs | Median Δ% | +| --- | --- | --- | ---: | ---: | +| ROCm 7 RC (no ROCWMMA) | pp512 | ROCm 7 RC (hipBLASLt) vs ROCm 7 RC (hipBLASLt OFF) | 15 | -0.2% | +| ROCm 7 RC (no ROCWMMA) | tg128 | ROCm 7 RC (hipBLASLt) vs ROCm 7 RC (hipBLASLt OFF) | 15 | -0.1% | +| ROCm 7 RC + ROCWMMA | pp512 | ROCm 7 RC + ROCWMMA + hipBLASLt vs ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 16 | 1.4% | +| ROCm 7 RC + ROCWMMA | tg128 | ROCm 7 RC + ROCWMMA + hipBLASLt vs ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 16 | 0.0% | +| ROCm 6.4.3 (no ROCWMMA) | pp512 | ROCm 6.4.3 (hipBLASLt) vs ROCm 6.4.3 (hipBLASLt OFF) | 9 | 155.5% | +| ROCm 6.4.3 (no ROCWMMA) | tg128 | ROCm 6.4.3 (hipBLASLt) vs ROCm 6.4.3 (hipBLASLt OFF) | 9 | 0.0% | +| ROCm 6.4.3 + ROCWMMA | pp512 | ROCm 6.4.3 + ROCWMMA (hipBLASLt) vs ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 13 | 116.9% | +| ROCm 6.4.3 + ROCWMMA | tg128 | ROCm 6.4.3 + ROCWMMA (hipBLASLt) vs ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 13 | -0.0% | + +### Vulkan: AMDVLK vs RADV +Head-to-head wins with selected Flash Attention filter: +| Test | AMDVLK wins | RADV wins | Ties | Total | +| --- | ---: | ---: | ---: | ---: | +| pp512 | 13 | 2 | 0 | 15 | +| tg128 | 2 | 13 | 0 | 15 | --- ## Recommendations - -* **Fastest prompt processing:** ROCm 7 RC + ROCWMMA + hipBLASLt (Flash Attention ON) -* **Fastest token generation:** Vulkan RADV (Flash Attention ON) -* **Balanced performance:** Vulkan AMDVLK (fast PP & decent TG, but ≤ 2 GiB buffer limit) -* **BF16 models:** ROCm 7 RC + ROCWMMA + hipBLASLt (best ROCm PP/TG combo, stable with FA ON) -* **Maximum stability:** Vulkan RADV +- **Fastest prompt processing:** ROCm 6.4.3 + ROCWMMA (hipBLASLt) (most 1st-place finishes with selected Flash Attention filter). +- **Fastest token generation:** Vulkan RADV (most 1st-place finishes with selected Flash Attention filter). +- **Balanced choice:** ROCm 6.4.3 + ROCWMMA (hipBLASLt) (consistently near the top across PP/TG). --- ## Winner calculation - -A backend is counted as a winner if its mean throughput is within the best backend’s pooled ± error margin for that model/test type. This ensures results within measurement noise are treated as ties, not false losses. +A backend is counted as a winner if its mean throughput is within the best backend’s pooled ± error margin for that model/test type. This treats results within measurement noise as ties instead of false losses. \ No newline at end of file diff --git a/docs/results.json b/docs/results.json index d05c75a..ccecf32 100644 --- a/docs/results.json +++ b/docs/results.json @@ -1,6 +1,6 @@ { "meta": { - "generated_at": "2025-08-17T07:42:51Z", + "generated_at": "2025-08-17T10:57:41Z", "os_kernel": "Fedora 42 \u2014 Linux 6.15.9-201.fc42.x86_64 (Sat Aug 2 11:37:34 UTC 2025)", "llamacpp_builds": [ { @@ -13,8 +13,6 @@ } ], "environments": [ - "rocm6_4_2", - "rocm6_4_2-rocwmma", "rocm6_4_3", "rocm6_4_3-hblt0", "rocm6_4_3-rocwmma", @@ -29,150 +27,6 @@ "notes": "pp512 = prompt processing; tg128 = text generation; t/s = tokens/second" }, "runs": [ - { - "model": "GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002", - "model_clean": "GLM-4.5-Air-UD-Q4_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "runtime", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": null, - "quant": "Q4_K_XL", - "log": "results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log", - "build": null - }, - { - "model": "GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002", - "model_clean": "GLM-4.5-Air-UD-Q4_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": null, - "quant": "Q4_K_XL", - "log": "results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log", - "build": null - }, - { - "model": "GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002", - "model_clean": "GLM-4.5-Air-UD-Q4_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "pp512", - "tps_mean": 131.14, - "tps_std": 0.28, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 110.47, - "file_size_gib": 68.01, - "name_params_b": 110.47, - "quant": "Q4_K_XL", - "log": "results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002", - "model_clean": "GLM-4.5-Air-UD-Q4_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "tg128", - "tps_mean": 20.15, - "tps_std": 0.01, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 110.47, - "file_size_gib": 68.01, - "name_params_b": 110.47, - "quant": "Q4_K_XL", - "log": "results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002", - "model_clean": "GLM-4.5-Air-UD-Q4_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "pp512", - "tps_mean": 104.12, - "tps_std": 0.05, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 110.47, - "file_size_gib": 68.01, - "name_params_b": 110.47, - "quant": "Q4_K_XL", - "log": "results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002", - "model_clean": "GLM-4.5-Air-UD-Q4_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "tg128", - "tps_mean": 20.35, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 110.47, - "file_size_gib": 68.01, - "name_params_b": 110.47, - "quant": "Q4_K_XL", - "log": "results/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, { "model": "GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002", "model_clean": "GLM-4.5-Air-UD-Q4_K_XL", @@ -1061,94 +915,6 @@ "number": "6182" } }, - { - "model": "GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003", - "model_clean": "GLM-4.5-Air-UD-Q6_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": null, - "quant": "Q6_K_XL", - "log": "results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2-rocwmma.log", - "build": null - }, - { - "model": "GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003", - "model_clean": "GLM-4.5-Air-UD-Q6_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "runtime", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": null, - "quant": "Q6_K_XL", - "log": "results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2-rocwmma__fa1.log", - "build": null - }, - { - "model": "GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003", - "model_clean": "GLM-4.5-Air-UD-Q6_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": null, - "quant": "Q6_K_XL", - "log": "results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2.log", - "build": null - }, - { - "model": "GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003", - "model_clean": "GLM-4.5-Air-UD-Q6_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": null, - "quant": "Q6_K_XL", - "log": "results/GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003__rocm6_4_2__fa1.log", - "build": null - }, { "model": "GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003", "model_clean": "GLM-4.5-Air-UD-Q6_K_XL", @@ -1981,122 +1747,6 @@ "number": "6182" } }, - { - "model": "Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002", - "model_clean": "Llama-3.3-70B-Instruct-UD-Q8_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 70.0, - "quant": "Q8_K_XL", - "log": "results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log", - "build": null - }, - { - "model": "Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002", - "model_clean": "Llama-3.3-70B-Instruct-UD-Q8_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 70.0, - "quant": "Q8_K_XL", - "log": "results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log", - "build": null - }, - { - "model": "Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002", - "model_clean": "Llama-3.3-70B-Instruct-UD-Q8_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 70.0, - "quant": "Q8_K_XL", - "log": "results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2.log", - "build": null - }, - { - "model": "Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002", - "model_clean": "Llama-3.3-70B-Instruct-UD-Q8_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "pp512", - "tps_mean": 16.16, - "tps_std": 0.02, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 70.55, - "file_size_gib": 75.65, - "name_params_b": 70.55, - "quant": "Q8_K_XL", - "log": "results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002", - "model_clean": "Llama-3.3-70B-Instruct-UD-Q8_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "tg128", - "tps_mean": 2.78, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 70.55, - "file_size_gib": 75.65, - "name_params_b": 70.55, - "quant": "Q8_K_XL", - "log": "results/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, { "model": "Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002", "model_clean": "Llama-3.3-70B-Instruct-UD-Q8_K_XL", @@ -2929,94 +2579,6 @@ "number": "6182" } }, - { - "model": "Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002", - "model_clean": "Llama-4-Scout-17B-16E-Instruct-Q6_K", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 17.0, - "quant": "Q6_K", - "log": "results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2-rocwmma.log", - "build": null - }, - { - "model": "Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002", - "model_clean": "Llama-4-Scout-17B-16E-Instruct-Q6_K", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 17.0, - "quant": "Q6_K", - "log": "results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2-rocwmma__fa1.log", - "build": null - }, - { - "model": "Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002", - "model_clean": "Llama-4-Scout-17B-16E-Instruct-Q6_K", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 17.0, - "quant": "Q6_K", - "log": "results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2.log", - "build": null - }, - { - "model": "Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002", - "model_clean": "Llama-4-Scout-17B-16E-Instruct-Q6_K", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 17.0, - "quant": "Q6_K", - "log": "results/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002__rocm6_4_2__fa1.log", - "build": null - }, { "model": "Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002", "model_clean": "Llama-4-Scout-17B-16E-Instruct-Q6_K", @@ -3793,94 +3355,6 @@ "number": "6182" } }, - { - "model": "Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003", - "model_clean": "Llama-4-Scout-17B-16E-Instruct-Q8_0", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 17.0, - "quant": "Q8_0", - "log": "results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2-rocwmma.log", - "build": null - }, - { - "model": "Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003", - "model_clean": "Llama-4-Scout-17B-16E-Instruct-Q8_0", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 17.0, - "quant": "Q8_0", - "log": "results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2-rocwmma__fa1.log", - "build": null - }, - { - "model": "Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003", - "model_clean": "Llama-4-Scout-17B-16E-Instruct-Q8_0", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 17.0, - "quant": "Q8_0", - "log": "results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2.log", - "build": null - }, - { - "model": "Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003", - "model_clean": "Llama-4-Scout-17B-16E-Instruct-Q8_0", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 17.0, - "quant": "Q8_0", - "log": "results/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003__rocm6_4_2__fa1.log", - "build": null - }, { "model": "Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003", "model_clean": "Llama-4-Scout-17B-16E-Instruct-Q8_0", @@ -4685,94 +4159,6 @@ "number": "6182" } }, - { - "model": "Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002", - "model_clean": "Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "runtime", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 17.0, - "quant": "Q4_K_XL", - "log": "results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma.log", - "build": null - }, - { - "model": "Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002", - "model_clean": "Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 17.0, - "quant": "Q4_K_XL", - "log": "results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2-rocwmma__fa1.log", - "build": null - }, - { - "model": "Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002", - "model_clean": "Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 17.0, - "quant": "Q4_K_XL", - "log": "results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2.log", - "build": null - }, - { - "model": "Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002", - "model_clean": "Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "runtime", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 17.0, - "quant": "Q4_K_XL", - "log": "results/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002__rocm6_4_2__fa1.log", - "build": null - }, { "model": "Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002", "model_clean": "Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL", @@ -5689,94 +5075,6 @@ "number": "6182" } }, - { - "model": "Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003", - "model_clean": "Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 235.0, - "quant": "Q3_K_XL", - "log": "results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2-rocwmma.log", - "build": null - }, - { - "model": "Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003", - "model_clean": "Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 235.0, - "quant": "Q3_K_XL", - "log": "results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2-rocwmma__fa1.log", - "build": null - }, - { - "model": "Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003", - "model_clean": "Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 235.0, - "quant": "Q3_K_XL", - "log": "results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2.log", - "build": null - }, - { - "model": "Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003", - "model_clean": "Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": 235.0, - "quant": "Q3_K_XL", - "log": "results/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003__rocm6_4_2__fa1.log", - "build": null - }, { "model": "Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003", "model_clean": "Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL", @@ -6553,206 +5851,6 @@ "number": "6182" } }, - { - "model": "Qwen3-30B-A3B-BF16-00001-of-00002", - "model_clean": "Qwen3-30B-A3B-BF16", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "pp512", - "tps_mean": 157.75, - "tps_std": 2.58, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 56.89, - "name_params_b": 30.53, - "quant": "BF16", - "log": "results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Qwen3-30B-A3B-BF16-00001-of-00002", - "model_clean": "Qwen3-30B-A3B-BF16", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "tg128", - "tps_mean": 24.62, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 56.89, - "name_params_b": 30.53, - "quant": "BF16", - "log": "results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Qwen3-30B-A3B-BF16-00001-of-00002", - "model_clean": "Qwen3-30B-A3B-BF16", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "pp512", - "tps_mean": 161.9, - "tps_std": 3.05, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 56.89, - "name_params_b": 30.53, - "quant": "BF16", - "log": "results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Qwen3-30B-A3B-BF16-00001-of-00002", - "model_clean": "Qwen3-30B-A3B-BF16", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "tg128", - "tps_mean": 24.09, - "tps_std": 0.02, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 56.89, - "name_params_b": 30.53, - "quant": "BF16", - "log": "results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Qwen3-30B-A3B-BF16-00001-of-00002", - "model_clean": "Qwen3-30B-A3B-BF16", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "pp512", - "tps_mean": 157.81, - "tps_std": 2.51, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 56.89, - "name_params_b": 30.53, - "quant": "BF16", - "log": "results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Qwen3-30B-A3B-BF16-00001-of-00002", - "model_clean": "Qwen3-30B-A3B-BF16", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "tg128", - "tps_mean": 24.61, - "tps_std": 0.01, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 56.89, - "name_params_b": 30.53, - "quant": "BF16", - "log": "results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Qwen3-30B-A3B-BF16-00001-of-00002", - "model_clean": "Qwen3-30B-A3B-BF16", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "pp512", - "tps_mean": 140.24, - "tps_std": 1.86, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 56.89, - "name_params_b": 30.53, - "quant": "BF16", - "log": "results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Qwen3-30B-A3B-BF16-00001-of-00002", - "model_clean": "Qwen3-30B-A3B-BF16", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "tg128", - "tps_mean": 24.46, - "tps_std": 0.02, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 56.89, - "name_params_b": 30.53, - "quant": "BF16", - "log": "results/Qwen3-30B-A3B-BF16-00001-of-00002__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, { "model": "Qwen3-30B-A3B-BF16-00001-of-00002", "model_clean": "Qwen3-30B-A3B-BF16", @@ -7697,206 +6795,6 @@ "number": "6182" } }, - { - "model": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "model_clean": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "pp512", - "tps_mean": 387.23, - "tps_std": 0.82, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 24.53, - "name_params_b": 30.53, - "quant": "Q6_K_XL", - "log": "results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "model_clean": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "tg128", - "tps_mean": 50.64, - "tps_std": 0.01, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 24.53, - "name_params_b": 30.53, - "quant": "Q6_K_XL", - "log": "results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "model_clean": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "pp512", - "tps_mean": 411.72, - "tps_std": 1.04, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 24.53, - "name_params_b": 30.53, - "quant": "Q6_K_XL", - "log": "results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "model_clean": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "tg128", - "tps_mean": 48.78, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 24.53, - "name_params_b": 30.53, - "quant": "Q6_K_XL", - "log": "results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "model_clean": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "pp512", - "tps_mean": 387.86, - "tps_std": 1.41, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 24.53, - "name_params_b": 30.53, - "quant": "Q6_K_XL", - "log": "results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "model_clean": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "tg128", - "tps_mean": 50.65, - "tps_std": 0.01, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 24.53, - "name_params_b": 30.53, - "quant": "Q6_K_XL", - "log": "results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "model_clean": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "pp512", - "tps_mean": 301.23, - "tps_std": 0.49, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 24.53, - "name_params_b": 30.53, - "quant": "Q6_K_XL", - "log": "results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "model_clean": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "tg128", - "tps_mean": 50.07, - "tps_std": 0.02, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 30.53, - "file_size_gib": 24.53, - "name_params_b": 30.53, - "quant": "Q6_K_XL", - "log": "results/Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, { "model": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", "model_clean": "Qwen3-30B-A3B-Instruct-2507-UD-Q6_K_XL", @@ -8897,206 +7795,6 @@ "number": "6182" } }, - { - "model": "gemma-3-12b-it-UD-Q8_K_XL", - "model_clean": "gemma-3-12b-it-UD-Q8_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "pp512", - "tps_mean": 222.91, - "tps_std": 0.21, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 11.77, - "file_size_gib": 13.4, - "name_params_b": 11.77, - "quant": "Q8_K_XL", - "log": "results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-12b-it-UD-Q8_K_XL", - "model_clean": "gemma-3-12b-it-UD-Q8_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "tg128", - "tps_mean": 14.03, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 11.77, - "file_size_gib": 13.4, - "name_params_b": 11.77, - "quant": "Q8_K_XL", - "log": "results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-12b-it-UD-Q8_K_XL", - "model_clean": "gemma-3-12b-it-UD-Q8_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "pp512", - "tps_mean": 229.15, - "tps_std": 0.24, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 11.77, - "file_size_gib": 13.4, - "name_params_b": 11.77, - "quant": "Q8_K_XL", - "log": "results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-12b-it-UD-Q8_K_XL", - "model_clean": "gemma-3-12b-it-UD-Q8_K_XL", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "tg128", - "tps_mean": 13.76, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 11.77, - "file_size_gib": 13.4, - "name_params_b": 11.77, - "quant": "Q8_K_XL", - "log": "results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-12b-it-UD-Q8_K_XL", - "model_clean": "gemma-3-12b-it-UD-Q8_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "pp512", - "tps_mean": 222.59, - "tps_std": 0.24, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 11.77, - "file_size_gib": 13.4, - "name_params_b": 11.77, - "quant": "Q8_K_XL", - "log": "results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-12b-it-UD-Q8_K_XL", - "model_clean": "gemma-3-12b-it-UD-Q8_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "tg128", - "tps_mean": 14.03, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 11.77, - "file_size_gib": 13.4, - "name_params_b": 11.77, - "quant": "Q8_K_XL", - "log": "results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-12b-it-UD-Q8_K_XL", - "model_clean": "gemma-3-12b-it-UD-Q8_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "pp512", - "tps_mean": 197.89, - "tps_std": 3.4, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 11.77, - "file_size_gib": 13.4, - "name_params_b": 11.77, - "quant": "Q8_K_XL", - "log": "results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-12b-it-UD-Q8_K_XL", - "model_clean": "gemma-3-12b-it-UD-Q8_K_XL", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "tg128", - "tps_mean": 13.76, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 11.77, - "file_size_gib": 13.4, - "name_params_b": 11.77, - "quant": "Q8_K_XL", - "log": "results/gemma-3-12b-it-UD-Q8_K_XL__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, { "model": "gemma-3-12b-it-UD-Q8_K_XL", "model_clean": "gemma-3-12b-it-UD-Q8_K_XL", @@ -10097,206 +8795,6 @@ "number": "6182" } }, - { - "model": "gemma-3-27b-it-BF16-00001-of-00002", - "model_clean": "gemma-3-27b-it-BF16", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "pp512", - "tps_mean": 87.2, - "tps_std": 3.7, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 27.01, - "file_size_gib": 50.31, - "name_params_b": 27.01, - "quant": "BF16", - "log": "results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-27b-it-BF16-00001-of-00002", - "model_clean": "gemma-3-27b-it-BF16", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "tg128", - "tps_mean": 4.09, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 27.01, - "file_size_gib": 50.31, - "name_params_b": 27.01, - "quant": "BF16", - "log": "results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-27b-it-BF16-00001-of-00002", - "model_clean": "gemma-3-27b-it-BF16", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "pp512", - "tps_mean": 68.87, - "tps_std": 14.37, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 27.01, - "file_size_gib": 50.31, - "name_params_b": 27.01, - "quant": "BF16", - "log": "results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-27b-it-BF16-00001-of-00002", - "model_clean": "gemma-3-27b-it-BF16", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "tg128", - "tps_mean": 4.08, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 27.01, - "file_size_gib": 50.31, - "name_params_b": 27.01, - "quant": "BF16", - "log": "results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-27b-it-BF16-00001-of-00002", - "model_clean": "gemma-3-27b-it-BF16", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "pp512", - "tps_mean": 82.57, - "tps_std": 10.36, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 27.01, - "file_size_gib": 50.31, - "name_params_b": 27.01, - "quant": "BF16", - "log": "results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-27b-it-BF16-00001-of-00002", - "model_clean": "gemma-3-27b-it-BF16", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "tg128", - "tps_mean": 4.09, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 27.01, - "file_size_gib": 50.31, - "name_params_b": 27.01, - "quant": "BF16", - "log": "results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-27b-it-BF16-00001-of-00002", - "model_clean": "gemma-3-27b-it-BF16", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "pp512", - "tps_mean": 74.78, - "tps_std": 10.12, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 27.01, - "file_size_gib": 50.31, - "name_params_b": 27.01, - "quant": "BF16", - "log": "results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-27b-it-BF16-00001-of-00002", - "model_clean": "gemma-3-27b-it-BF16", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "tg128", - "tps_mean": 4.09, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 27.01, - "file_size_gib": 50.31, - "name_params_b": 27.01, - "quant": "BF16", - "log": "results/gemma-3-27b-it-BF16-00001-of-00002__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, { "model": "gemma-3-27b-it-BF16-00001-of-00002", "model_clean": "gemma-3-27b-it-BF16", @@ -11241,206 +9739,6 @@ "number": "6182" } }, - { - "model": "gemma-3-4b-it-Q3_K_S", - "model_clean": "gemma-3-4b-it-Q3_K_S", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "pp512", - "tps_mean": 728.7, - "tps_std": 1.28, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 3.88, - "file_size_gib": 1.8, - "name_params_b": 3.88, - "quant": "Q3_K_S", - "log": "results/gemma-3-4b-it-Q3_K_S__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-4b-it-Q3_K_S", - "model_clean": "gemma-3-4b-it-Q3_K_S", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "tg128", - "tps_mean": 76.63, - "tps_std": 0.03, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 3.88, - "file_size_gib": 1.8, - "name_params_b": 3.88, - "quant": "Q3_K_S", - "log": "results/gemma-3-4b-it-Q3_K_S__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-4b-it-Q3_K_S", - "model_clean": "gemma-3-4b-it-Q3_K_S", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "pp512", - "tps_mean": 752.52, - "tps_std": 0.83, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 3.88, - "file_size_gib": 1.8, - "name_params_b": 3.88, - "quant": "Q3_K_S", - "log": "results/gemma-3-4b-it-Q3_K_S__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-4b-it-Q3_K_S", - "model_clean": "gemma-3-4b-it-Q3_K_S", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "tg128", - "tps_mean": 70.93, - "tps_std": 0.02, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 3.88, - "file_size_gib": 1.8, - "name_params_b": 3.88, - "quant": "Q3_K_S", - "log": "results/gemma-3-4b-it-Q3_K_S__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-4b-it-Q3_K_S", - "model_clean": "gemma-3-4b-it-Q3_K_S", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "pp512", - "tps_mean": 729.33, - "tps_std": 1.93, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 3.88, - "file_size_gib": 1.8, - "name_params_b": 3.88, - "quant": "Q3_K_S", - "log": "results/gemma-3-4b-it-Q3_K_S__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-4b-it-Q3_K_S", - "model_clean": "gemma-3-4b-it-Q3_K_S", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "tg128", - "tps_mean": 76.79, - "tps_std": 0.03, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 3.88, - "file_size_gib": 1.8, - "name_params_b": 3.88, - "quant": "Q3_K_S", - "log": "results/gemma-3-4b-it-Q3_K_S__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-4b-it-Q3_K_S", - "model_clean": "gemma-3-4b-it-Q3_K_S", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "pp512", - "tps_mean": 645.25, - "tps_std": 0.89, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 3.88, - "file_size_gib": 1.8, - "name_params_b": 3.88, - "quant": "Q3_K_S", - "log": "results/gemma-3-4b-it-Q3_K_S__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gemma-3-4b-it-Q3_K_S", - "model_clean": "gemma-3-4b-it-Q3_K_S", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "tg128", - "tps_mean": 70.31, - "tps_std": 0.01, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 3.88, - "file_size_gib": 1.8, - "name_params_b": 3.88, - "quant": "Q3_K_S", - "log": "results/gemma-3-4b-it-Q3_K_S__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, { "model": "gemma-3-4b-it-Q3_K_S", "model_clean": "gemma-3-4b-it-Q3_K_S", @@ -12441,206 +10739,6 @@ "number": "6182" } }, - { - "model": "gpt-oss-120b-F16", - "model_clean": "gpt-oss-120b-F16", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "pp512", - "tps_mean": 355.59, - "tps_std": 0.86, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 116.83, - "file_size_gib": 60.87, - "name_params_b": 116.83, - "quant": "F16", - "log": "results/gpt-oss-120b-F16__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-120b-F16", - "model_clean": "gpt-oss-120b-F16", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "tg128", - "tps_mean": 33.97, - "tps_std": 0.01, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 116.83, - "file_size_gib": 60.87, - "name_params_b": 116.83, - "quant": "F16", - "log": "results/gpt-oss-120b-F16__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-120b-F16", - "model_clean": "gpt-oss-120b-F16", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "pp512", - "tps_mean": 390.43, - "tps_std": 0.7, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 116.83, - "file_size_gib": 60.87, - "name_params_b": 116.83, - "quant": "F16", - "log": "results/gpt-oss-120b-F16__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-120b-F16", - "model_clean": "gpt-oss-120b-F16", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "tg128", - "tps_mean": 33.81, - "tps_std": 0.01, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 116.83, - "file_size_gib": 60.87, - "name_params_b": 116.83, - "quant": "F16", - "log": "results/gpt-oss-120b-F16__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-120b-F16", - "model_clean": "gpt-oss-120b-F16", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "pp512", - "tps_mean": 355.94, - "tps_std": 1.35, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 116.83, - "file_size_gib": 60.87, - "name_params_b": 116.83, - "quant": "F16", - "log": "results/gpt-oss-120b-F16__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-120b-F16", - "model_clean": "gpt-oss-120b-F16", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "tg128", - "tps_mean": 33.97, - "tps_std": 0.01, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 116.83, - "file_size_gib": 60.87, - "name_params_b": 116.83, - "quant": "F16", - "log": "results/gpt-oss-120b-F16__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-120b-F16", - "model_clean": "gpt-oss-120b-F16", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "pp512", - "tps_mean": 322.57, - "tps_std": 0.31, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 116.83, - "file_size_gib": 60.87, - "name_params_b": 116.83, - "quant": "F16", - "log": "results/gpt-oss-120b-F16__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-120b-F16", - "model_clean": "gpt-oss-120b-F16", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "tg128", - "tps_mean": 33.3, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 116.83, - "file_size_gib": 60.87, - "name_params_b": 116.83, - "quant": "F16", - "log": "results/gpt-oss-120b-F16__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, { "model": "gpt-oss-120b-F16", "model_clean": "gpt-oss-120b-F16", @@ -13641,178 +11739,6 @@ "number": "6182" } }, - { - "model": "gpt-oss-120b-mxfp4-00001-of-00003", - "model_clean": "gpt-oss-120b-mxfp4", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "pp512", - "tps_mean": 353.2, - "tps_std": 0.3, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 116.83, - "file_size_gib": 59.02, - "name_params_b": 116.83, - "quant": "MXFP4", - "log": "results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-120b-mxfp4-00001-of-00003", - "model_clean": "gpt-oss-120b-mxfp4", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "tg128", - "tps_mean": 45.42, - "tps_std": 0.01, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 116.83, - "file_size_gib": 59.02, - "name_params_b": 116.83, - "quant": "MXFP4", - "log": "results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-120b-mxfp4-00001-of-00003", - "model_clean": "gpt-oss-120b-mxfp4", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "pp512", - "tps_mean": 387.1, - "tps_std": 0.42, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 116.83, - "file_size_gib": 59.02, - "name_params_b": 116.83, - "quant": "MXFP4", - "log": "results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-120b-mxfp4-00001-of-00003", - "model_clean": "gpt-oss-120b-mxfp4", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "tg128", - "tps_mean": 45.16, - "tps_std": 0.01, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 116.83, - "file_size_gib": 59.02, - "name_params_b": 116.83, - "quant": "MXFP4", - "log": "results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-120b-mxfp4-00001-of-00003", - "model_clean": "gpt-oss-120b-mxfp4", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": null, - "tps_mean": null, - "tps_std": null, - "error": true, - "error_type": "hang", - "backend": null, - "ngl": null, - "mmap": null, - "params_b": null, - "file_size_gib": null, - "name_params_b": null, - "quant": "MXFP4", - "log": "results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2.log", - "build": null - }, - { - "model": "gpt-oss-120b-mxfp4-00001-of-00003", - "model_clean": "gpt-oss-120b-mxfp4", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "pp512", - "tps_mean": 319.84, - "tps_std": 0.73, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 116.83, - "file_size_gib": 59.02, - "name_params_b": 116.83, - "quant": "MXFP4", - "log": "results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-120b-mxfp4-00001-of-00003", - "model_clean": "gpt-oss-120b-mxfp4", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "tg128", - "tps_mean": 44.43, - "tps_std": 0.02, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 116.83, - "file_size_gib": 59.02, - "name_params_b": 116.83, - "quant": "MXFP4", - "log": "results/gpt-oss-120b-mxfp4-00001-of-00003__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, { "model": "gpt-oss-120b-mxfp4-00001-of-00003", "model_clean": "gpt-oss-120b-mxfp4", @@ -14785,206 +12711,6 @@ "number": "6182" } }, - { - "model": "gpt-oss-20b-F32", - "model_clean": "gpt-oss-20b-F32", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "pp512", - "tps_mean": 324.3, - "tps_std": 4.23, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 38.97, - "name_params_b": 20.91, - "quant": "F32", - "log": "results/gpt-oss-20b-F32__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-20b-F32", - "model_clean": "gpt-oss-20b-F32", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "tg128", - "tps_mean": 27.1, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 38.97, - "name_params_b": 20.91, - "quant": "F32", - "log": "results/gpt-oss-20b-F32__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-20b-F32", - "model_clean": "gpt-oss-20b-F32", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "pp512", - "tps_mean": 342.14, - "tps_std": 4.83, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 38.97, - "name_params_b": 20.91, - "quant": "F32", - "log": "results/gpt-oss-20b-F32__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-20b-F32", - "model_clean": "gpt-oss-20b-F32", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "tg128", - "tps_mean": 27.05, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 38.97, - "name_params_b": 20.91, - "quant": "F32", - "log": "results/gpt-oss-20b-F32__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-20b-F32", - "model_clean": "gpt-oss-20b-F32", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "pp512", - "tps_mean": 324.36, - "tps_std": 4.35, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 38.97, - "name_params_b": 20.91, - "quant": "F32", - "log": "results/gpt-oss-20b-F32__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-20b-F32", - "model_clean": "gpt-oss-20b-F32", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "tg128", - "tps_mean": 27.12, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 38.97, - "name_params_b": 20.91, - "quant": "F32", - "log": "results/gpt-oss-20b-F32__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-20b-F32", - "model_clean": "gpt-oss-20b-F32", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "pp512", - "tps_mean": 304.23, - "tps_std": 3.73, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 38.97, - "name_params_b": 20.91, - "quant": "F32", - "log": "results/gpt-oss-20b-F32__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-20b-F32", - "model_clean": "gpt-oss-20b-F32", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "tg128", - "tps_mean": 26.85, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 38.97, - "name_params_b": 20.91, - "quant": "F32", - "log": "results/gpt-oss-20b-F32__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, { "model": "gpt-oss-20b-F32", "model_clean": "gpt-oss-20b-F32", @@ -15985,206 +13711,6 @@ "number": "6182" } }, - { - "model": "gpt-oss-20b-mxfp4", - "model_clean": "gpt-oss-20b-mxfp4", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "pp512", - "tps_mean": 582.6, - "tps_std": 4.9, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 11.27, - "name_params_b": 20.91, - "quant": "MXFP4", - "log": "results/gpt-oss-20b-mxfp4__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-20b-mxfp4", - "model_clean": "gpt-oss-20b-mxfp4", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": false, - "test": "tg128", - "tps_mean": 64.91, - "tps_std": 0.01, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 11.27, - "name_params_b": 20.91, - "quant": "MXFP4", - "log": "results/gpt-oss-20b-mxfp4__rocm6_4_2-rocwmma.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-20b-mxfp4", - "model_clean": "gpt-oss-20b-mxfp4", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "pp512", - "tps_mean": 644.05, - "tps_std": 3.87, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 11.27, - "name_params_b": 20.91, - "quant": "MXFP4", - "log": "results/gpt-oss-20b-mxfp4__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-20b-mxfp4", - "model_clean": "gpt-oss-20b-mxfp4", - "env": "rocm6_4_2-rocwmma", - "env_base": "rocm6_4_2", - "env_variant": "rocwmma", - "fa": true, - "test": "tg128", - "tps_mean": 64.63, - "tps_std": 0.01, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 11.27, - "name_params_b": 20.91, - "quant": "MXFP4", - "log": "results/gpt-oss-20b-mxfp4__rocm6_4_2-rocwmma__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-20b-mxfp4", - "model_clean": "gpt-oss-20b-mxfp4", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "pp512", - "tps_mean": 581.11, - "tps_std": 2.96, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 11.27, - "name_params_b": 20.91, - "quant": "MXFP4", - "log": "results/gpt-oss-20b-mxfp4__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-20b-mxfp4", - "model_clean": "gpt-oss-20b-mxfp4", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": false, - "test": "tg128", - "tps_mean": 65.0, - "tps_std": 0.02, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 11.27, - "name_params_b": 20.91, - "quant": "MXFP4", - "log": "results/gpt-oss-20b-mxfp4__rocm6_4_2.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-20b-mxfp4", - "model_clean": "gpt-oss-20b-mxfp4", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "pp512", - "tps_mean": 522.29, - "tps_std": 2.36, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 11.27, - "name_params_b": 20.91, - "quant": "MXFP4", - "log": "results/gpt-oss-20b-mxfp4__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, - { - "model": "gpt-oss-20b-mxfp4", - "model_clean": "gpt-oss-20b-mxfp4", - "env": "rocm6_4_2", - "env_base": "rocm6_4_2", - "env_variant": null, - "fa": true, - "test": "tg128", - "tps_mean": 63.63, - "tps_std": 0.0, - "error": false, - "error_type": null, - "backend": "ROCm", - "ngl": 99, - "mmap": 0, - "params_b": 20.91, - "file_size_gib": 11.27, - "name_params_b": 20.91, - "quant": "MXFP4", - "log": "results/gpt-oss-20b-mxfp4__rocm6_4_2__fa1.log", - "build": { - "hash": "de219279", - "number": "6181" - } - }, { "model": "gpt-oss-20b-mxfp4", "model_clean": "gpt-oss-20b-mxfp4",