Updated benchmakrs, removed old toolboxes and results

This commit is contained in:
Donato Capitella
2025-08-17 12:32:08 +01:00
parent 62e5080102
commit b71a37647f
130 changed files with 733 additions and 14425 deletions
+1 -1
View File
@@ -28,7 +28,7 @@ jobs:
IN='${{ inputs.backends }}' IN='${{ inputs.backends }}'
if [[ "$IN" == "all" || -z "$IN" ]]; then if [[ "$IN" == "all" || -z "$IN" ]]; then
JSON='["rocm-6.4.2","rocm-6.4.2-rocwmma","rocm-6.4.3","rocm-6.4.3-rocwmma","rocm-7rc","rocm-7rc-rocwmma","vulkan-amdvlk","vulkan-radv"]' JSON='["rocm-6.4.3","rocm-6.4.3-rocwmma","rocm-7rc","rocm-7rc-rocwmma","vulkan-amdvlk","vulkan-radv"]'
else else
# Remove spaces and build JSON array from comma list # Remove spaces and build JSON array from comma list
IN_CLEAN=$(echo "$IN" | tr -d '[:space:]') IN_CLEAN=$(echo "$IN" | tr -d '[:space:]')
+32 -35
View File
@@ -47,18 +47,16 @@ You can check the containers on DockerHub: https://hub.docker.com/r/kyuz0/amd-st
| -------------------- | ------------------------ | --------------- | | -------------------- | ------------------------ | --------------- |
| `vulkan-amdvlk` | Vulkan (AMDVLK) | Fastest backend—AMD open-source driver. ≤2 GiB single buffer allocation limit, some large models won't load. | | `vulkan-amdvlk` | Vulkan (AMDVLK) | Fastest backend—AMD open-source driver. ≤2 GiB single buffer allocation limit, some large models won't load. |
| `vulkan-radv` | Vulkan (Mesa RADV) | Most stable and compatible. Recommended for most users and all models. | | `vulkan-radv` | Vulkan (Mesa RADV) | Most stable and compatible. Recommended for most users and all models. |
| `rocm-6.4.2` | ROCm 6.4.2 (HIP) | Latest stable ROCm. Great for BF16 models. Occasional crashes possible. |
| `rocm-6.4.2-rocwmma` | ROCm 6.4.2 (HIP) + ROCWMMA | ROCm with ROCWMMA enabled for improved flash attention on RDNA3+/CDNA. |
| `rocm-6.4.3` | ROCm 6.4.3 (HIP) + hipBLASLt* | Latest stable ROCm. Great for BF16 models. Occasional crashes possible. | | `rocm-6.4.3` | ROCm 6.4.3 (HIP) + hipBLASLt* | Latest stable ROCm. Great for BF16 models. Occasional crashes possible. |
| `rocm-6.4.3-rocwmma` | ROCm 6.4.3 (HIP) + ROCWMMA + hipBLASLt* | ROCm with ROCWMMA enabled for improved flash attention on RDNA3+/CDNA. | | `rocm-6.4.3-rocwmma` | ROCm 6.4.3 (HIP) + ROCWMMA + hipBLASLt* | ROCm with ROCWMMA enabled for improved flash attention on RDNA3+/CDNA. |
| `rocm-7rc` | ROCm 7.0 RC (HIP) + hipBLASLt* | Release candidate for ROCm 7.0. Same behavior as beta. | | `rocm-7rc` | ROCm 7.0 RC (HIP) + hipBLASLt* | Release candidate for ROCm 7.0. |
| `rocm-7rc-rocwmma` | ROCm 7.0 RC (HIP) + ROCWMMA + hipBLASLt* | Release candidate for ROCm 7.0, with hipBLASLt and ROCWMMA for improved flash attention on RDNA3+/CDNA | | `rocm-7rc-rocwmma` | ROCm 7.0 RC (HIP) + ROCWMMA + hipBLASLt* | Release candidate for ROCm 7.0, with hipBLASLt and ROCWMMA for improved flash attention on RDNA3+/CDNA |
\* All these toolboxes now export `ROCBLAS_USE_HIPBLASLT=1` as this currently results in better perfromance and stability in *MOST* cases. \* All these toolboxes now export `ROCBLAS_USE_HIPBLASLT=1` as this currently results in better perfromance and stability in *MOST* cases.
> These containers are **automatically** rebuilt whenever the Llama.cpp master branch is updated, ensuring you get the latest bug fixes and new model support. The easiest way to update to the newest versions is by running the `refresh-toolboxes.sh` [script below](#211-toolbox-refresh-script-automatic-updates). > These containers are **automatically** rebuilt whenever the Llama.cpp master branch is updated, ensuring you get the latest bug fixes and new model support. The easiest way to update to the newest versions is by running the `refresh-toolboxes.sh` [script below](#211-toolbox-refresh-script-automatic-updates).
> *Each container is based on Fedora Rawhide and is built for maximum compatibility and performance on Strix Halo.* > *rocm-6.4.2* and *rocm-7beta* coontainers have been retired in favour of *rocm-6.4.3* and *rocm_7rc*.
--- ---
@@ -80,8 +78,8 @@ To use Llama.cpp with hardware acceleration inside a toolbox container, you must
* **For ROCm:** You must expose both `/dev/dri` and `/dev/kfd`, and add the user to extra groups for compute access. * **For ROCm:** You must expose both `/dev/dri` and `/dev/kfd`, and add the user to extra groups for compute access.
```sh ```sh
toolbox create llama-rocm-6.4.2 \ toolbox create llama-rocm-6.4.3-rocwmma \
--image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.2 \ --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.3-rocwmma \
-- --device /dev/dri --device /dev/kfd \ -- --device /dev/dri --device /dev/kfd \
--group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined
``` ```
@@ -114,7 +112,7 @@ This will:
You can also refresh just one or more toolboxes: You can also refresh just one or more toolboxes:
```bash ```bash
./refreshtoolboxes.sh llama-vulkan-amdvlk llama-rocm-6.4.2 ./refreshtoolboxes.sh llama-vulkan-radv llama-rocm-6.4.3-rocwmma
``` ```
### 2.2 Running models inside the toolboxes ### 2.2 Running models inside the toolboxes
@@ -150,39 +148,38 @@ HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3-Coder-30B-A3B
## 3. Performance Benchmarks (Key Results) ## 3. Performance Benchmarks (Key Results)
Benchmarks were run on **AMD Ryzen AI Max “Strix Halo”** across all supported backends, testing both **prompt processing (PP)** and **token generation (TG)** throughput.
Reported values were analysed using error margins (mean ± σ). Backends whose ranges overlapped were treated as statistical ties rather than hard wins.
🌐 Interactive exploration of the latest benchmark runs: [Interactie Benchmark Viewer](https://kyuz0.github.io/amd-strix-halo-toolboxes/) 🌐 Interactive exploration of the latest benchmark runs: [Interactie Benchmark Viewer](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
Benchmarks were analysed with **error-aware ties** (mean ± σ). If two backends overlap within margins, they are treated as a tie. All placement counts below use **Flash Attention ON**.
| Workload Focus | 🏆 Recommended Backend/Config | Win + Tie Count¹ | Typical Runner-Up | Stability Notes | **Prompt Processing (pp512)**
| ------------------------------------------------- | ----------------------------------- | ---------------: | ---------------------------------- | ------------------------------------------------------------------------------------- | | Backend | 1st | 2nd | 3rd |
| **Prompt processing** (pp512, Flash Attention ON) | **ROCm 7 RC + ROCWMMA + hipBLASLt** | 15 | Vulkan AMDVLK (4) | 0% errors in tests | | --- | ---: | ---: | ---: |
| **Token generation** (tg128, Flash Attention ON) | **Vulkan RADV** | 13 | Vulkan AMDVLK (1) | 0% errors in tests | | ROCm 6.4.3 + ROCWMMA (hipBLASLt) | 9 | 5 | 0 |
| **Balanced workloads** | **Vulkan AMDVLK** | — | RADV / ROCm 7 RC+ROCWMMA+hipBLASLt | Fast PP & decent TG; \~5.6 % load failure rate due to ≤ 2 GiB single-allocation limit | | ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 3 | 3 | 8 |
| **BF16 models** | **ROCm 7 RC + ROCWMMA + hipBLASLt** | — | ROCm 6.4.2 + ROCWMMA | Best PP & TG among ROCm backends; stable with Flash Attention ON | | Vulkan AMDVLK | 3 | 0 | 2 |
| ROCm 7 RC + ROCWMMA + hipBLASLt | 1 | 8 | 4 |
| ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 0 | 0 | 1 |
| Vulkan RADV | 0 | 0 | 1 |
¹ Counts show number of times the backend placed 1st (alone or tied) across tested models/quantisations. **Token Generation (tg128)**
| Backend | 1st | 2nd | 3rd |
| --- | ---: | ---: | ---: |
| Vulkan RADV | 13 | 0 | 0 |
| ROCm 6.4.3 (hipBLASLt) | 3 | 0 | 1 |
| ROCm 6.4.3 + ROCWMMA (hipBLASLt) | 1 | 4 | 3 |
| ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) | 1 | 2 | 4 |
| ROCm 6.4.3 (hipBLASLt OFF) | 1 | 1 | 1 |
| ROCm 7 RC (hipBLASLt OFF) | 1 | 1 | 1 |
| ROCm 7 RC + ROCWMMA (hipBLASLt OFF) | 1 | 1 | 1 |
| ROCm 7 RC (hipBLASLt) | 1 | 0 | 4 |
| Vulkan AMDVLK | 0 | 10 | 0 |
| ROCm 7 RC + ROCWMMA + hipBLASLt | 0 | 1 | 2 |
### Summary & Recommendations
### Key take-aways - **Fastest prompt processing:** ROCm 6.4.3 + ROCWMMA (hipBLASLt) (most 1st-place finishes).
- **Fastest token generation:** Vulkan RADV (most 1st-place finishes).
* **ROCm 7 RC + ROCWMMA + hipBLASLt + Flash Attention ON** - **Balanced choice:** ROCm 6.4.3 + ROCWMMA (hipBLASLt) (consistently near the top across PP/TG).
* Fastest prompt processing in the vast majority of tests (15/22 wins or ties).
* Best ROCm option for BF16 models.
* Zero recorded errors with Flash Attention ON.
* **Vulkan RADV**
* Best token generation throughput (13/15 wins or ties).
* Most stable and broadly compatible backend overall.
* **Vulkan AMDVLK**
* Competitive in both PP and TG; benefits from margin-aware tie handling.
* Limited by ≤ 2 GiB single buffer allocation, which can block some model architectures.
* Other ROCm variants (beta, hblt0, 6.4.2 w/o ROCWMMA)
* Inconsistent performance and/or higher error rates; best suited for experimental use.
📄 Full per-model analysis: [docs/benchmarks.md](docs/benchmarks.md) 📄 Full per-model analysis: [docs/benchmarks.md](docs/benchmarks.md)
+571
View File
@@ -0,0 +1,571 @@
#!/usr/bin/env python3
"""
gen_benchmarks_md.py — Generate Markdown for README + detailed benchmarks from results.json
Defaults:
- Input JSON: ../docs/results.json
- Outputs: ./README_benchmarks_section.md and ./benchmarks_generated.md
"""
from __future__ import annotations
import json
import argparse
import statistics as stats
from pathlib import Path
from collections import defaultdict
from typing import Dict, List, Tuple, Optional
# === ENV LABELS ===
ENV_LABEL: Dict[str, str] = {
# ROCm 7 RC
"rocm7_rc-rocwmma": "ROCm 7 RC + ROCWMMA + hipBLASLt",
"rocm7_rc": "ROCm 7 RC (hipBLASLt)",
"rocm7_rc-hblt0": "ROCm 7 RC (hipBLASLt OFF)",
"rocm7_rc-rocwmma-hblt0": "ROCm 7 RC + ROCWMMA (hipBLASLt OFF)",
# ROCm 6.4.3
"rocm6_4_3": "ROCm 6.4.3 (hipBLASLt)",
"rocm6_4_3-hblt0": "ROCm 6.4.3 (hipBLASLt OFF)",
"rocm6_4_3-rocwmma": "ROCm 6.4.3 + ROCWMMA (hipBLASLt)",
"rocm6_4_3-rocwmma-hblt0": "ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF)",
# Vulkan
"vulkan_amdvlk": "Vulkan AMDVLK",
"vulkan_radv": "Vulkan RADV",
}
TESTS = ["pp512", "tg128"]
def md_row(values: List[str]) -> str:
return "| " + " | ".join(values) + " |"
def load_results(path: Path) -> Dict:
data = json.loads(path.read_text())
assert "runs" in data and isinstance(data["runs"], list), "results.json must have a top-level 'runs' list"
return data
def envs_present(runs: List[Dict], only_env: Optional[List[str]], include_all_envs: bool) -> List[str]:
present = {r.get("env") for r in runs if r.get("env")}
if only_env:
present = present.intersection(set(only_env))
if include_all_envs:
# Include even if not present (might appear 0 rows in tables)
envs = [e for e in ENV_LABEL.keys() if (not only_env or e in only_env)]
else:
envs = [e for e in ENV_LABEL.keys() if e in present and (not only_env or e in only_env)]
return envs
def fa_to_filter(fa: str) -> Optional[bool]:
fa = fa.lower().strip()
if fa == "on":
return True
if fa == "off":
return False
if fa == "any":
return None
raise ValueError("--fa must be on/off/any")
def margin_aware_placements(
runs: List[Dict],
envs: List[str],
test_filter: str,
fa_filter: Optional[bool]
) -> Tuple[Dict[str, Dict[str, int]], int]:
"""
Returns (placements, sample_count)
placements[env] -> {"first": n, "second": n, "third": n}
sample_count = number of model+quant comparisons considered
"""
placements = defaultdict(lambda: {"first": 0, "second": 0, "third": 0})
# group by (model, quant)
grouped = defaultdict(list)
for r in runs:
if r.get("error"):
continue
if r.get("test") != test_filter:
continue
if fa_filter is not None and r.get("fa") != fa_filter:
continue
if r.get("env") not in envs:
continue
key = (r.get("model_clean"), r.get("quant"))
grouped[key].append(r)
samples = 0
for key, entries in grouped.items():
# collate by env
env_groups = defaultdict(list)
for e in entries:
env_groups[e["env"]].append(e)
env_list = [e for e in envs if e in env_groups] # keep requested order
if len(env_list) < 2:
continue
# summarize median mean ± median err per env
summary = {}
for env in env_list:
means = [x["tps_mean"] for x in env_groups[env] if x.get("tps_mean") is not None]
errs = [x.get("tps_err", 0.0) or 0.0 for x in env_groups[env]]
if not means:
continue
m = stats.median(means)
e = stats.median(errs) if errs else 0.0
summary[env] = (m - e, m + e, m)
if len(summary) < 2:
continue
samples += 1
# rank with overlap -> ties share rank
remaining = [env for env, _ in sorted(summary.items(), key=lambda kv: kv[1][2], reverse=True)]
assigned = {}
current_rank = 1
while remaining and current_rank <= 3:
env0 = remaining[0]
low0, high0, _ = summary[env0]
tied = [env0]
for env in remaining[1:]:
low, high, _ = summary[env]
if not (low > high0 or high < low0): # overlap -> tie
tied.append(env)
for env in tied:
assigned[env] = current_rank
remaining = [e for e in remaining if e not in tied]
current_rank += 1
for env, rk in assigned.items():
if rk == 1:
placements[env]["first"] += 1
elif rk == 2:
placements[env]["second"] += 1
elif rk == 3:
placements[env]["third"] += 1
return placements, samples
def pairwise_win_counts(runs: List[Dict], envA: str, envB: str, test: str, fa_filter: Optional[bool]) -> Tuple[int, int, int, int]:
A = {}
B = {}
for r in runs:
if r.get("error") or r.get("test") != test:
continue
if fa_filter is not None and r.get("fa") != fa_filter:
continue
key = (r.get("model_clean"), r.get("quant"))
if r.get("env") == envA:
A[key] = r["tps_mean"]
elif r.get("env") == envB:
B[key] = r["tps_mean"]
winsA = winsB = ties = 0
for k in (set(A) & set(B)):
if A[k] > B[k]:
winsA += 1
elif B[k] > A[k]:
winsB += 1
else:
ties += 1
total = winsA + winsB + ties
return winsA, winsB, ties, total
def average_ranks(place_dict: Dict[str, Dict[str, int]]) -> Dict[str, Optional[float]]:
avg = {}
for env, c in place_dict.items():
total = c.get("first", 0) + c.get("second", 0) + c.get("third", 0)
if total == 0:
avg[env] = None
else:
avg[env] = round((1 * c.get("first", 0) + 2 * c.get("second", 0) + 3 * c.get("third", 0)) / total, 2)
return avg
def flash_attention_effect(runs: List[Dict], envs: List[str]) -> Dict[str, Dict[str, Dict[str, float]]]:
"""
Returns: effects[env][test] = {n_pairs, median_pct, min, max}
Based on paired model+quant runs (ON vs OFF).
"""
model_pairs = defaultdict(lambda: defaultdict(dict)) # (env,test)->(model,quant)->{fa: tps}
for r in runs:
if r.get("error") or r.get("tps_mean") is None:
continue
if r.get("test") not in TESTS:
continue
if r.get("env") not in envs:
continue
model_key = (r.get("model_clean"), r.get("quant"))
model_pairs[(r["env"], r["test"])][model_key][r.get("fa")] = r["tps_mean"]
summary = defaultdict(dict)
for (env, test), d in model_pairs.items():
deltas = []
for mk, vals in d.items():
if True in vals and False in vals and vals[False] > 0:
deltas.append((vals[True] - vals[False]) / vals[False] * 100.0)
if deltas:
summary[env][test] = {
"n_pairs": len(deltas),
"median_pct": round(stats.median(deltas), 1),
"min": round(min(deltas), 1),
"max": round(max(deltas), 1),
}
return summary
def rocwmma_effect(runs: List[Dict], pairs_to_compare: List[Tuple[str, str, str]], tests: List[str]) -> List[Tuple[str, str, str, str, int, float]]:
"""
Compare ROCWMMA ON vs OFF with same hipBLASLt state.
Returns rows of (context_label, test, env_on, env_off, n_pairs, median_delta_pct)
where delta_pct = median(ON/OFF - 1)*100 over common model+quant.
"""
rows = []
for env_on, env_off, label in pairs_to_compare:
for test in tests:
data_on = defaultdict(list)
data_off = defaultdict(list)
for r in runs:
if r.get("error") or r.get("test") != test:
continue
if r.get("env") == env_on:
data_on[(r.get("model_clean"), r.get("quant"))].append(r["tps_mean"])
elif r.get("env") == env_off:
data_off[(r.get("model_clean"), r.get("quant"))].append(r["tps_mean"])
common = sorted(set(data_on) & set(data_off))
if not common:
continue
ratios = []
for k in common:
aon = stats.median(data_on[k])
aoff = stats.median(data_off[k])
if aoff > 0:
ratios.append(aon / aoff - 1.0)
if ratios:
rows.append((label, test, env_on, env_off, len(ratios), round(100 * stats.median(ratios), 1)))
return rows
def hipblaslt_effect(runs: List[Dict], pairs_to_compare: List[Tuple[str, str, str]], tests: List[str]) -> List[Tuple[str, str, str, str, int, float]]:
"""
Compare hipBLASLt ON vs OFF with same ROCWMMA state.
Returns rows of (context_label, test, env_on, env_off, n_pairs, median_delta_pct)
where delta_pct = median(ON/OFF - 1)*100 over common model+quant.
"""
rows = []
for env_on, env_off, label in pairs_to_compare:
for test in tests:
data_on = defaultdict(list)
data_off = defaultdict(list)
for r in runs:
if r.get("error") or r.get("test") != test:
continue
if r.get("env") == env_on:
data_on[(r.get("model_clean"), r.get("quant"))].append(r["tps_mean"])
elif r.get("env") == env_off:
data_off[(r.get("model_clean"), r.get("quant"))].append(r["tps_mean"])
common = sorted(set(data_on) & set(data_off))
if not common:
continue
ratios = []
for k in common:
aon = stats.median(data_on[k])
aoff = stats.median(data_off[k])
if aoff > 0:
ratios.append(aon / aoff - 1.0)
if ratios:
rows.append((label, test, env_on, env_off, len(ratios), round(100 * stats.median(ratios), 1)))
return rows
def amdvlk_vs_radv(runs: List[Dict], fa_filter: Optional[bool]) -> List[Tuple[str, int, int, int, int]]:
rows = []
for test in TESTS:
wa, wr, ties, total = pairwise_win_counts(runs, "vulkan_amdvlk", "vulkan_radv", test, fa_filter)
rows.append((test, wa, wr, ties, total))
return rows
def winners(place_dict: Dict[str, Dict[str, int]], slot="first") -> Tuple[List[str], int]:
max_count = max((c.get(slot, 0) for c in place_dict.values()), default=0)
win_list = [env for env, c in place_dict.items() if c.get(slot, 0) == max_count and max_count > 0]
return win_list, max_count
def human_list(envs: List[str]) -> str:
return ", ".join(ENV_LABEL.get(e, e) for e in envs) if envs else ""
def build_readme_section(
envs: List[str],
pp_place: Dict[str, Dict[str, int]],
tg_place: Dict[str, Dict[str, int]],
fa_filter: Optional[bool]
) -> str:
# Winners
pp_wins, _ = winners(pp_place, "first")
tg_wins, _ = winners(tg_place, "first")
lines: List[str] = []
lines.append("## 3. Performance Benchmarks (Key Results)")
lines.append("")
lines.append("🌐 Interactive exploration of the latest benchmark runs: [Interactie Benchmark Viewer](https://kyuz0.github.io/amd-strix-halo-toolboxes/)")
lines.append("")
lines.append("Benchmarks were analysed with **error-aware ties** (mean ± σ). If two backends overlap within margins, they are treated as a tie. All placement counts below use **Flash Attention ON**.")
lines.append("")
# Placement tables
def place_table(title: str, place_dict: Dict[str, Dict[str, int]]):
lines.append(f"**{title}**")
lines.append(md_row(["Backend", "1st", "2nd", "3rd"]))
lines.append(md_row(["---", "---:", "---:", "---:"]))
order = sorted(place_dict.items(), key=lambda kv: (-kv[1].get("first", 0), -kv[1].get("second", 0), kv[0]))
for env, c in order:
lines.append(md_row([ENV_LABEL.get(env, env), str(c.get("first", 0)), str(c.get("second", 0)), str(c.get("third", 0))]))
lines.append("")
place_table("Prompt Processing (pp512)", pp_place)
place_table("Token Generation (tg128)", tg_place)
# Data-driven recommendations
def total_score(c: Dict[str, int]) -> int:
# weight 1st more than 2nd
return c.get("first", 0) * 2 + c.get("second", 0)
best_bal_score = -1
balanced: List[str] = []
for env in envs:
score = total_score(pp_place.get(env, {})) + total_score(tg_place.get(env, {}))
if score > best_bal_score:
best_bal_score = score
balanced = [env]
elif score == best_bal_score:
balanced.append(env)
lines.append("### Summary & Recommendations")
lines.append(f"- **Fastest prompt processing:** {human_list(pp_wins)} (most 1st-place finishes).")
lines.append(f"- **Fastest token generation:** {human_list(tg_wins)} (most 1st-place finishes).")
lines.append(f"- **Balanced choice:** {human_list(balanced)} (consistently near the top across PP/TG).")
lines.append("")
lines.append("> **Note (ROCm 7):** Toolboxes enable **hipBLASLt** by default. The benchmark suite also runs **hipBLASLt OFF** variants to show its impact.")
return "\n".join(lines)
def build_benchmarks_doc(
runs: List[Dict],
envs: List[str],
pp_place: Dict[str, Dict[str, int]],
tg_place: Dict[str, Dict[str, int]],
fa_filter: Optional[bool],
) -> str:
lines: List[str] = []
lines.append("# AMD Strix Halo — llama.cpp Toolboxes (Benchmarks)")
lines.append("")
lines.append("**Interactive results:** https://kyuz0.github.io/amd-strix-halo-toolboxes/")
lines.append("")
lines.append("## Table of Contents")
lines.append("- [Benchmark methodology](#benchmark-methodology)")
lines.append("- [Summary of current dataset (Flash Attention ON)](#summary-of-current-dataset-flash-attention-on)")
lines.append(" - [Placement counts](#placement-counts)")
lines.append(" - [Pairwise head-to-head wins](#pairwise-head-to-head-wins)")
lines.append(" - [Average ranks](#average-ranks)")
lines.append("- [Analyses by feature](#analyses-by-feature)")
lines.append(" - [Impact of Flash Attention](#impact-of-flash-attention)")
lines.append(" - [Impact of ROCWMMA](#impact-of-rocwmma)")
lines.append(" - [Impact of hipBLASLt](#impact-of-hipblaslt)")
lines.append(" - [Vulkan: AMDVLK vs RADV](#vulkan-amdvlk-vs-radv)")
lines.append("- [Recommendations](#recommendations)")
lines.append("- [Winner calculation](#winner-calculation)")
lines.append("")
lines.append("---")
lines.append("")
lines.append("## Benchmark methodology")
lines.append("")
lines.append("- **pp512** — prompt processing throughput (tokens/sec, prefill)")
lines.append("- **tg128** — token generation throughput (tokens/sec, interactive)")
lines.append("- Each backend tested twice per model: `-fa 0` and `-fa 1`")
lines.append("- Winners per model/test are **margin-aware**; multiple winners are possible when mean±σ overlap")
lines.append("- Built from the same llama.cpp commit for consistency")
lines.append("")
lines.append("**Backends in this dataset:** " + ", ".join(ENV_LABEL.get(e, e) for e in envs))
lines.append("")
lines.append("**ROCm 7 hipBLASLt policy:** Toolboxes ship with **hipBLASLt enabled** by default (`ROCBLAS_USE_HIPBLASLT=1`). The benchmark script also runs **hipBLASLt OFF** variants (`-hblt0`) to measure its effect.")
lines.append("")
lines.append("---")
lines.append("")
lines.append("## Summary of current dataset (Flash Attention ON)")
lines.append("")
# Placement counts
lines.append("### Placement counts")
def place_block(title: str, place_dict: Dict[str, Dict[str, int]]):
lines.append(f"**{title}**")
lines.append(md_row(["Backend", "1st", "2nd", "3rd"]))
lines.append(md_row(["---", "---:", "---:", "---:"]))
order = sorted(place_dict.items(), key=lambda kv: (-kv[1].get("first", 0), -kv[1].get("second", 0), kv[0]))
for env, c in order:
lines.append(md_row([ENV_LABEL.get(env, env), str(c.get("first", 0)), str(c.get("second", 0)), str(c.get("third", 0))]))
lines.append("")
place_block("Prompt Processing (pp512)", pp_place)
place_block("Token Generation (tg128)", tg_place)
# Pairwise wins
lines.append("### Pairwise head-to-head wins")
lines.append("For any model+quant where both backends succeeded, this counts who was faster (ties when equal).")
lines.append(md_row(["Comparison", "Test", "A wins", "B wins", "Ties", "Total"]))
lines.append(md_row(["---", "---", "---:", "---:", "---:", "---:"]))
pairs = [
("ROCm 7 RC + ROCWMMA + hipBLASLt", "Vulkan AMDVLK", "rocm7_rc-rocwmma", "vulkan_amdvlk"),
("ROCm 7 RC + ROCWMMA + hipBLASLt", "Vulkan RADV", "rocm7_rc-rocwmma", "vulkan_radv"),
("Vulkan AMDVLK", "Vulkan RADV", "vulkan_amdvlk", "vulkan_radv"),
]
for labelA, labelB, envA, envB in pairs:
for test in TESTS:
a, b, t, total = pairwise_win_counts(runs, envA, envB, test, fa_filter)
lines.append(md_row([f"{labelA} vs {labelB}", test, str(a), str(b), str(t), str(total)]))
lines.append("")
# Average ranks
lines.append("### Average ranks")
avg_pp = average_ranks(pp_place)
avg_tg = average_ranks(tg_place)
lines.append("**Prompt Processing (pp512)**")
lines.append(md_row(["Backend", "Avg Rank (↓ is better)"]))
lines.append(md_row(["---", "---:"]))
for env, val in sorted(avg_pp.items(), key=lambda kv: (kv[1] is None, kv[1] or 99)):
lines.append(md_row([ENV_LABEL.get(env, env), str(val) if val is not None else ""]))
lines.append("")
lines.append("**Token Generation (tg128)**")
lines.append(md_row(["Backend", "Avg Rank (↓ is better)"]))
lines.append(md_row(["---", "---:"]))
for env, val in sorted(avg_tg.items(), key=lambda kv: (kv[1] is None, kv[1] or 99)):
lines.append(md_row([ENV_LABEL.get(env, env), str(val) if val is not None else ""]))
lines.append("")
lines.append("---")
lines.append("")
lines.append("## Analyses by feature")
lines.append("")
# Flash Attention effect
lines.append("### Impact of Flash Attention")
fa_eff = flash_attention_effect(runs, envs)
lines.append("Median % change when **Flash Attention ON vs OFF**, paired by model+quant, per backend:")
lines.append(md_row(["Backend", "pp512 Δ% (median, min..max, n)", "tg128 Δ% (median, min..max, n)"]))
lines.append(md_row(["---", "---", "---"]))
def fmt_eff(row: Optional[Dict[str, float]]) -> str:
return f"{row['median_pct']}% ({row['min']}..{row['max']}), n={row['n_pairs']}" if row else ""
for env in envs:
row_pp = fa_eff.get(env, {}).get("pp512")
row_tg = fa_eff.get(env, {}).get("tg128")
lines.append(md_row([ENV_LABEL.get(env, env), fmt_eff(row_pp), fmt_eff(row_tg)]))
lines.append("")
# ROCWMMA effect — check both ROCm 7 and 6.4.3 families if present
lines.append("### Impact of ROCWMMA")
rocwmma_pairs = []
if "rocm7_rc-rocwmma" in envs and "rocm7_rc" in envs:
rocwmma_pairs.append(("rocm7_rc-rocwmma", "rocm7_rc", "ROCm 7 RC (hipBLASLt)"))
if "rocm7_rc-rocwmma-hblt0" in envs and "rocm7_rc-hblt0" in envs:
rocwmma_pairs.append(("rocm7_rc-rocwmma-hblt0", "rocm7_rc-hblt0", "ROCm 7 RC (hipBLASLt OFF)"))
if "rocm6_4_3-rocwmma" in envs and "rocm6_4_3" in envs:
rocwmma_pairs.append(("rocm6_4_3-rocwmma", "rocm6_4_3", "ROCm 6.4.3 (hipBLASLt)"))
if "rocm6_4_3-rocwmma-hblt0" in envs and "rocm6_4_3-hblt0" in envs:
rocwmma_pairs.append(("rocm6_4_3-rocwmma-hblt0", "rocm6_4_3-hblt0", "ROCm 6.4.3 (hipBLASLt OFF)"))
rocwmma_rows = rocwmma_effect(runs, rocwmma_pairs, TESTS)
lines.append(md_row(["Context", "Test", "Compared Envs", "Pairs", "Median Δ%"]))
lines.append(md_row(["---", "---", "---", "---:", "---:"]))
for label, test, env_on, env_off, n, delta in rocwmma_rows:
lines.append(md_row([label, test, f"{ENV_LABEL.get(env_on, env_on)} vs {ENV_LABEL.get(env_off, env_off)}", str(n), f"{delta}%"]))
lines.append("")
# hipBLASLt effect — for both ROCm 7 and 6.4.3 families
lines.append("### Impact of hipBLASLt")
hip_pairs = []
if "rocm7_rc" in envs and "rocm7_rc-hblt0" in envs:
hip_pairs.append(("rocm7_rc", "rocm7_rc-hblt0", "ROCm 7 RC (no ROCWMMA)"))
if "rocm7_rc-rocwmma" in envs and "rocm7_rc-rocwmma-hblt0" in envs:
hip_pairs.append(("rocm7_rc-rocwmma", "rocm7_rc-rocwmma-hblt0", "ROCm 7 RC + ROCWMMA"))
if "rocm6_4_3" in envs and "rocm6_4_3-hblt0" in envs:
hip_pairs.append(("rocm6_4_3", "rocm6_4_3-hblt0", "ROCm 6.4.3 (no ROCWMMA)"))
if "rocm6_4_3-rocwmma" in envs and "rocm6_4_3-rocwmma-hblt0" in envs:
hip_pairs.append(("rocm6_4_3-rocwmma", "rocm6_4_3-rocwmma-hblt0", "ROCm 6.4.3 + ROCWMMA"))
hip_rows = hipblaslt_effect(runs, hip_pairs, TESTS)
lines.append(md_row(["Context", "Test", "Compared Envs", "Pairs", "Median Δ%"]))
lines.append(md_row(["---", "---", "---", "---:", "---:"]))
for label, test, env_on, env_off, n, delta in hip_rows:
lines.append(md_row([label, test, f"{ENV_LABEL.get(env_on, env_on)} vs {ENV_LABEL.get(env_off, env_off)}", str(n), f"{delta}%"]))
lines.append("")
# AMDVLK vs RADV
lines.append("### Vulkan: AMDVLK vs RADV")
lines.append("Head-to-head wins with selected Flash Attention filter:")
lines.append(md_row(["Test", "AMDVLK wins", "RADV wins", "Ties", "Total"]))
lines.append(md_row(["---", "---:", "---:", "---:", "---:"]))
for test, wa, wr, t, total in amdvlk_vs_radv(runs, fa_filter):
lines.append(md_row([test, str(wa), str(wr), str(t), str(total)]))
lines.append("")
lines.append("---")
lines.append("")
lines.append("## Recommendations")
pp_wins, _ = winners(pp_place, "first")
tg_wins, _ = winners(tg_place, "first")
lines.append(f"- **Fastest prompt processing:** {human_list(pp_wins)} (most 1st-place finishes with selected Flash Attention filter).")
lines.append(f"- **Fastest token generation:** {human_list(tg_wins)} (most 1st-place finishes with selected Flash Attention filter).")
# Balanced: highest (2*first + second) across PP+TG
def score(c: Dict[str, int]) -> int:
return c.get("first", 0) * 2 + c.get("second", 0)
best_bal = -1
balanced: List[str] = []
for env in envs:
s = score(pp_place.get(env, {})) + score(tg_place.get(env, {}))
if s > best_bal:
best_bal = s
balanced = [env]
elif s == best_bal:
balanced.append(env)
lines.append(f"- **Balanced choice:** {human_list(balanced)} (consistently near the top across PP/TG).")
lines.append("")
lines.append("---")
lines.append("")
lines.append("## Winner calculation")
lines.append("A backend is counted as a winner if its mean throughput is within the best backends pooled ± error margin for that model/test type. This treats results within measurement noise as ties instead of false losses.")
return "\n".join(lines)
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--file", type=Path, default=Path("../docs/results.json"),
help="Path to results.json (default: ../docs/results.json)")
ap.add_argument("--out-readme", type=Path, default=Path("./README_benchmarks_section.md"),
help="Path to write README section Markdown (default: ./README_benchmarks_section.md)")
ap.add_argument("--out-bench", type=Path, default=Path("./benchmarks_generated.md"),
help="Path to write detailed benchmarks Markdown (default: ./benchmarks_generated.md)")
ap.add_argument("--fa", choices=["on", "off", "any"], default="on",
help="Flash Attention filter (default: on)")
ap.add_argument("--include-all-envs", action="store_true",
help="Include envs even if not present in results.json")
ap.add_argument("--only-env", action="append",
help="Restrict analysis to specific env keys (repeatable)")
args = ap.parse_args()
data = load_results(args.file)
runs: List[Dict] = data["runs"]
fa_filter = fa_to_filter(args.fa)
envs = envs_present(runs, args.only_env, args.include_all_envs)
pp_place, _ = margin_aware_placements(runs, envs, "pp512", fa_filter)
tg_place, _ = margin_aware_placements(runs, envs, "tg128", fa_filter)
readme_md = build_readme_section(envs, pp_place, tg_place, fa_filter)
args.out_readme.write_text(readme_md)
bench_md = build_benchmarks_doc(runs, envs, pp_place, tg_place, fa_filter)
args.out_bench.write_text(bench_md)
print(f"Wrote:\n - {args.out_readme}\n - {args.out_bench}")
if __name__ == "__main__":
main()
-175
View File
@@ -1,175 +0,0 @@
#!/usr/bin/env python3
import json, re
from collections import defaultdict
from pathlib import Path
RESULTS_FILE = "../docs/results.json"
# Column order + labels
ENV_ORDER = [
"vulkan_amdvlk",
"vulkan_radv",
"rocm6_4_2",
"rocm6_4_2-rocwmma",
"rocm7_beta",
"rocm7_rc",
]
COL_NAMES = {
"vulkan_amdvlk": "Vulkan (AMDVLK)",
"vulkan_radv": "Vulkan (RADV)",
"rocm6_4_2": "ROCm 6.4.2",
"rocm6_4_2-rocwmma": "ROCm 6.4.2 + ROCWMMA",
"rocm7_beta": "ROCm 7.0 Beta",
"rocm7_rc": "ROCm 7.0 RC",
}
WINNER_NAMES = {
"vulkan_amdvlk": "AMDVLK",
"vulkan_radv": "RADV",
"rocm6_4_2": "ROCm6.4.2",
"rocm6_4_2-rocwmma": "ROCm6.4.2+ROCWMMA",
"rocm7_beta": "ROCm7 Beta",
"rocm7_rc": "ROCm7 RC",
}
ERROR_LABEL = {
"load": "⚠️ Load Error",
"hang": "⚠️ GPU Hang",
"runtime": "⚠️ Runtime Error",
}
DEFAULT_MODELS = [
("Gemma3 12B Q8_0", "gemma-3-12b"),
("Gemma3 27B BF16", "gemma-3-27b"),
("Llama-4-Scout 17B Q8_0", "llama-4-scout-17b-16e-instruct-q8_0"),
("Llama-4-Scout 17B Q4_K XL", "llama-4-scout-17b-16e-instruct-q4_k_xl"),
("Qwen3 30B BF16", "qwen3-30b-a3b-bf16"),
("Qwen3-235B Q3_K XL", "qwen3-235b-a22b"),
("GLM-4.5-Air-Q4_K_XL", "glm-4.5-air-q4_k_xl"),
("GLM-4.5-Air-Q6_K_XL", "glm-4.5-air-q6_k_xl"),
("gpt-oss-120b-mxfp4", "gpt-oss-120b-mxfp4"),
("gpt-oss-20b-mxfp4", "gpt-oss-20b-mxfp4"),
]
SHARD_RE = re.compile(r"-000\d+-of-000\d+", re.IGNORECASE)
def norm_model(s: str) -> str:
s = (s or "").lower().replace("_", "-")
s = SHARD_RE.sub("", s)
s = s.replace("-ud", "")
return s
raw = json.loads(Path(RESULTS_FILE).read_text(encoding="utf-8"))
runs = raw["runs"]
buckets = defaultdict(list)
error_only = defaultdict(list)
all_models = set()
for r in runs:
env = r.get("env")
if env not in ENV_ORDER:
continue
mkey = norm_model(r.get("model_clean") or r.get("model") or "")
all_models.add(mkey)
test = r.get("test")
if test in ("pp512", "tg128"):
buckets[(mkey, env, test)].append(r)
else:
if r.get("error"):
error_only[(mkey, env)].append(r.get("error_type") or "runtime")
def pick_best(rows):
best, best_val, fallback = None, -1, None
for r in rows:
if r.get("error"):
fallback = r
continue
v = r.get("tps_mean")
if isinstance(v, (int, float)) and v > best_val:
best_val, best = v, r
return best or fallback
chosen = defaultdict(lambda: defaultdict(dict))
for (mkey, env, test), rows in buckets.items():
chosen_row = pick_best(rows)
chosen[mkey][env][test] = chosen_row
for (mkey, env), etypes in error_only.items():
if etypes:
if "load" in etypes:
chosen[mkey][env]["error_only"] = "load"
elif "hang" in etypes:
chosen[mkey][env]["error_only"] = "hang"
else:
chosen[mkey][env]["error_only"] = "runtime"
def fa_tag(row):
if not row or row.get("error"):
return ""
fa = row.get("fa")
if fa is None:
return ""
return " (FA on)" if fa else " (FA off)"
def format_cell(entry_dict):
pp = entry_dict.get("pp512")
tg = entry_dict.get("tg128")
for row in (pp, tg):
if row and row.get("error"):
return ERROR_LABEL.get(row.get("error_type") or "runtime", "⚠️ Error")
if not pp and not tg:
et = entry_dict.get("error_only")
if et:
return ERROR_LABEL.get(et, "⚠️ Error")
return ""
def fmt(v):
return f"{int(round(v))}" if isinstance(v, (int, float)) else ""
ppv = pp.get("tps_mean") if pp else None
tgv = tg.get("tps_mean") if tg else None
pp_suffix = fa_tag(pp)
tg_suffix = fa_tag(tg)
if isinstance(tgv, (int, float)):
return f"{fmt(ppv)} pp{pp_suffix} / {tgv:.1f} tg{tg_suffix}"
else:
return f"{fmt(ppv)} pp{pp_suffix} / — tg"
def best_env_for(mkey, test):
best_env, best_val, best_row = None, -1, None
for env in ENV_ORDER:
row = chosen[mkey].get(env, {}).get(test)
if not row or row.get("error"):
continue
v = row.get("tps_mean")
if isinstance(v, (int, float)) and v > best_val:
best_env, best_val, best_row = env, v, row
return best_env, (best_row.get("fa") if best_row else None)
def win_label(env, fa):
if not env:
return ""
base = WINNER_NAMES[env]
if fa is None:
return f"🏆 **{base}**"
return f"🏆 **{base}** ({'FA on' if fa else 'FA off'})"
def find_model_key(fuzzy):
needle = norm_model(fuzzy)
for k in all_models:
if needle in k:
return k
return None
# Header now has Best PP & Best TG right after Model
header = ["Model", "🏆 Best PP", "🏆 Best TG"] + [COL_NAMES[e] for e in ENV_ORDER]
print("| " + " | ".join(header) + " |")
print("|" + "|".join(["---"] * len(header)) + "|")
for disp, fuzzy in DEFAULT_MODELS:
mkey = find_model_key(fuzzy)
if not mkey:
print("| " + " | ".join([f"**{disp}**", "", ""] + [""]*len(ENV_ORDER)) + " |")
continue
bpp_env, bpp_fa = best_env_for(mkey, "pp512")
btg_env, btg_fa = best_env_for(mkey, "tg128")
row = [f"**{disp}**", win_label(bpp_env, bpp_fa), win_label(btg_env, btg_fa)]
for env in ENV_ORDER:
row.append(format_cell(chosen[mkey].get(env, {})))
print("| " + " | ".join(row) + " |")
@@ -1,172 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /home/kyuz0/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Kimi-Dev-72B
llama_model_loader: - kv 3: general.basename str = Kimi-Dev-72B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 72B
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Kimi Dev 72B
llama_model_loader: - kv 10: general.base_model.0.organization str = Moonshotai
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/moonshotai/Kim...
llama_model_loader: - kv 12: general.tags arr[str,5] = ["code", "unsloth", "swebench", "soft...
llama_model_loader: - kv 13: qwen2.block_count u32 = 80
llama_model_loader: - kv 14: qwen2.context_length u32 = 131072
llama_model_loader: - kv 15: qwen2.embedding_length u32 = 8192
llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 29568
llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 64
llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 7
llama_model_loader: - kv 32: quantize.imatrix.file str = Kimi-Dev-72B-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Kimi-Dev-72B.txt
llama_model_loader: - kv 34: quantize.imatrix.entries_count u32 = 560
llama_model_loader: - kv 35: quantize.imatrix.chunks_count u32 = 685
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.tensors.count i32 = 963
llama_model_loader: - kv 38: split.count u16 = 2
llama_model_loader: - type f32: 401 tensors
llama_model_loader: - type f16: 107 tensors
llama_model_loader: - type q8_0: 455 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 78.21 GiB (9.24 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 29568
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 72.71 B
print_info: general.name = Kimi-Dev-72B
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: ROCm0 model buffer size = 77715.11 MiB
load_tensors: ROCm_Host model buffer size = 2376.00 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 313.00 MiB
llama_context: ROCm_Host compute buffer size = 8.01 MiB
llama_context: graph nodes = 2887
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 1808727616
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello0
llama_perf_sampler_print: sampling time = 0.06 ms / 2 runs ( 0.03 ms per token, 31746.03 tokens per second)
llama_perf_context_print: load time = 31744.47 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 463.93 ms / 1 runs ( 463.93 ms per token, 2.16 tokens per second)
llama_perf_context_print: total time = 470.35 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 36.639378936s
Run #3 status: 0
→ Avg over 3 runs: 35.301s
@@ -1,172 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /home/kyuz0/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Kimi-Dev-72B
llama_model_loader: - kv 3: general.basename str = Kimi-Dev-72B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 72B
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Kimi Dev 72B
llama_model_loader: - kv 10: general.base_model.0.organization str = Moonshotai
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/moonshotai/Kim...
llama_model_loader: - kv 12: general.tags arr[str,5] = ["code", "unsloth", "swebench", "soft...
llama_model_loader: - kv 13: qwen2.block_count u32 = 80
llama_model_loader: - kv 14: qwen2.context_length u32 = 131072
llama_model_loader: - kv 15: qwen2.embedding_length u32 = 8192
llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 29568
llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 64
llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 7
llama_model_loader: - kv 32: quantize.imatrix.file str = Kimi-Dev-72B-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Kimi-Dev-72B.txt
llama_model_loader: - kv 34: quantize.imatrix.entries_count u32 = 560
llama_model_loader: - kv 35: quantize.imatrix.chunks_count u32 = 685
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.tensors.count i32 = 963
llama_model_loader: - kv 38: split.count u16 = 2
llama_model_loader: - type f32: 401 tensors
llama_model_loader: - type f16: 107 tensors
llama_model_loader: - type q8_0: 455 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 78.21 GiB (9.24 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 29568
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 72.71 B
print_info: general.name = Kimi-Dev-72B
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: ROCm0 model buffer size = 77715.11 MiB
load_tensors: ROCm_Host model buffer size = 2376.00 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 313.00 MiB
llama_context: ROCm_Host compute buffer size = 8.01 MiB
llama_context: graph nodes = 2887
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 3691857665
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello0
llama_perf_sampler_print: sampling time = 0.07 ms / 2 runs ( 0.04 ms per token, 27027.03 tokens per second)
llama_perf_context_print: load time = 30932.72 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 559.63 ms / 1 runs ( 559.63 ms per token, 1.79 tokens per second)
llama_perf_context_print: total time = 566.03 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 32.156014765s
Run #3 status: 0
→ Avg over 3 runs: 30.024s
@@ -1,172 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /home/kyuz0/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Kimi-Dev-72B
llama_model_loader: - kv 3: general.basename str = Kimi-Dev-72B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 72B
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Kimi Dev 72B
llama_model_loader: - kv 10: general.base_model.0.organization str = Moonshotai
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/moonshotai/Kim...
llama_model_loader: - kv 12: general.tags arr[str,5] = ["code", "unsloth", "swebench", "soft...
llama_model_loader: - kv 13: qwen2.block_count u32 = 80
llama_model_loader: - kv 14: qwen2.context_length u32 = 131072
llama_model_loader: - kv 15: qwen2.embedding_length u32 = 8192
llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 29568
llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 64
llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 7
llama_model_loader: - kv 32: quantize.imatrix.file str = Kimi-Dev-72B-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Kimi-Dev-72B.txt
llama_model_loader: - kv 34: quantize.imatrix.entries_count u32 = 560
llama_model_loader: - kv 35: quantize.imatrix.chunks_count u32 = 685
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.tensors.count i32 = 963
llama_model_loader: - kv 38: split.count u16 = 2
llama_model_loader: - type f32: 401 tensors
llama_model_loader: - type f16: 107 tensors
llama_model_loader: - type q8_0: 455 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 78.21 GiB (9.24 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 29568
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 72.71 B
print_info: general.name = Kimi-Dev-72B
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: ROCm0 model buffer size = 77715.11 MiB
load_tensors: ROCm_Host model buffer size = 2376.00 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 313.00 MiB
llama_context: ROCm_Host compute buffer size = 8.01 MiB
llama_context: graph nodes = 2887
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 3133611532
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello0
llama_perf_sampler_print: sampling time = 0.06 ms / 2 runs ( 0.03 ms per token, 35087.72 tokens per second)
llama_perf_context_print: load time = 25127.98 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 383.37 ms / 1 runs ( 383.37 ms per token, 2.61 tokens per second)
llama_perf_context_print: total time = 389.90 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 26.238043008s
Run #3 status: 0
→ Avg over 3 runs: 26.362s
@@ -1,123 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /home/kyuz0/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Kimi-Dev-72B
llama_model_loader: - kv 3: general.basename str = Kimi-Dev-72B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 72B
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Kimi Dev 72B
llama_model_loader: - kv 10: general.base_model.0.organization str = Moonshotai
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/moonshotai/Kim...
llama_model_loader: - kv 12: general.tags arr[str,5] = ["code", "unsloth", "swebench", "soft...
llama_model_loader: - kv 13: qwen2.block_count u32 = 80
llama_model_loader: - kv 14: qwen2.context_length u32 = 131072
llama_model_loader: - kv 15: qwen2.embedding_length u32 = 8192
llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 29568
llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 64
llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 7
llama_model_loader: - kv 32: quantize.imatrix.file str = Kimi-Dev-72B-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Kimi-Dev-72B.txt
llama_model_loader: - kv 34: quantize.imatrix.entries_count u32 = 560
llama_model_loader: - kv 35: quantize.imatrix.chunks_count u32 = 685
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.tensors.count i32 = 963
llama_model_loader: - kv 38: split.count u16 = 2
llama_model_loader: - type f32: 401 tensors
llama_model_loader: - type f16: 107 tensors
llama_model_loader: - type q8_0: 455 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 78.21 GiB (9.24 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 29568
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 72.71 B
print_info: general.name = Kimi-Dev-72B
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
ggml_vulkan: Device memory allocation of size 2491416576 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
alloc_tensor_range: failed to allocate Vulkan0 buffer of size 2491416576
llama_model_load: error loading model: unable to allocate Vulkan0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/home/kyuz0/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf'
main: error: unable to load model
Elapsed #3: .334893088s
Run #3 status: 1
✖ run #3 failed
→ No successful runs
@@ -1,170 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /home/kyuz0/models/kimi-dev-72B-Q8_K_XL/UD-Q8_K_XL/Kimi-Dev-72B-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Kimi-Dev-72B
llama_model_loader: - kv 3: general.basename str = Kimi-Dev-72B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 72B
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Kimi Dev 72B
llama_model_loader: - kv 10: general.base_model.0.organization str = Moonshotai
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/moonshotai/Kim...
llama_model_loader: - kv 12: general.tags arr[str,5] = ["code", "unsloth", "swebench", "soft...
llama_model_loader: - kv 13: qwen2.block_count u32 = 80
llama_model_loader: - kv 14: qwen2.context_length u32 = 131072
llama_model_loader: - kv 15: qwen2.embedding_length u32 = 8192
llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 29568
llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 64
llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 7
llama_model_loader: - kv 32: quantize.imatrix.file str = Kimi-Dev-72B-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Kimi-Dev-72B.txt
llama_model_loader: - kv 34: quantize.imatrix.entries_count u32 = 560
llama_model_loader: - kv 35: quantize.imatrix.chunks_count u32 = 685
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.tensors.count i32 = 963
llama_model_loader: - kv 38: split.count u16 = 2
llama_model_loader: - type f32: 401 tensors
llama_model_loader: - type f16: 107 tensors
llama_model_loader: - type q8_0: 455 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 78.21 GiB (9.24 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 29568
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 72.71 B
print_info: general.name = Kimi-Dev-72B
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: Vulkan0 model buffer size = 77715.09 MiB
load_tensors: Vulkan_Host model buffer size = 2376.00 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: Vulkan0 KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 313.00 MiB
llama_context: Vulkan_Host compute buffer size = 24.01 MiB
llama_context: graph nodes = 2887
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 4071074447
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello beğen
llama_perf_sampler_print: sampling time = 0.05 ms / 2 runs ( 0.03 ms per token, 37037.04 tokens per second)
llama_perf_context_print: load time = 29902.30 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 392.32 ms / 1 runs ( 392.32 ms per token, 2.55 tokens per second)
llama_perf_context_print: total time = 399.50 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 30.654893638s
Run #3 status: 0
→ Avg over 3 runs: 30.591s
@@ -1,163 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-3.3-70B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.3-70B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: llama.block_count u32 = 80
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 8192
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 12: llama.attention.head_count u32 = 64
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 7
llama_model_loader: - kv 32: quantize.imatrix.file str = Llama-3.3-70B-Instruct-GGUF/imatrix_u...
llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Llama-3.3-70B-Ins...
llama_model_loader: - kv 34: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 35: quantize.imatrix.chunks_count i32 = 689
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.tensors.count i32 = 724
llama_model_loader: - kv 38: split.count u16 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q8_0: 455 tensors
llama_model_loader: - type bf16: 107 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 75.65 GiB (9.21 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 28672
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 70.55 B
print_info: general.name = Llama-3.3-70B-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: PAD token = 128004 '<|finetune_right_pad_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: ROCm0 model buffer size = 75456.53 MiB
load_tensors: ROCm_Host model buffer size = 2004.00 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.49 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 266.50 MiB
llama_context: ROCm_Host compute buffer size = 8.01 MiB
llama_context: graph nodes = 2647
llama_context: graph splits = 1
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 192699360
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello,
llama_perf_sampler_print: sampling time = 0.05 ms / 3 runs ( 0.02 ms per token, 63829.79 tokens per second)
llama_perf_context_print: load time = 24487.91 ms
llama_perf_context_print: prompt eval time = 368.54 ms / 2 tokens ( 184.27 ms per token, 5.43 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 383.50 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 28.922457711s
Run #3 status: 0
→ Avg over 3 runs: 30.998s
@@ -1,163 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-3.3-70B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.3-70B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: llama.block_count u32 = 80
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 8192
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 12: llama.attention.head_count u32 = 64
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 7
llama_model_loader: - kv 32: quantize.imatrix.file str = Llama-3.3-70B-Instruct-GGUF/imatrix_u...
llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Llama-3.3-70B-Ins...
llama_model_loader: - kv 34: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 35: quantize.imatrix.chunks_count i32 = 689
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.tensors.count i32 = 724
llama_model_loader: - kv 38: split.count u16 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q8_0: 455 tensors
llama_model_loader: - type bf16: 107 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 75.65 GiB (9.21 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 28672
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 70.55 B
print_info: general.name = Llama-3.3-70B-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: PAD token = 128004 '<|finetune_right_pad_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: ROCm0 model buffer size = 75456.53 MiB
load_tensors: ROCm_Host model buffer size = 2004.00 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.49 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 266.50 MiB
llama_context: ROCm_Host compute buffer size = 8.01 MiB
llama_context: graph nodes = 2647
llama_context: graph splits = 1
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 3478849877
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello H
llama_perf_sampler_print: sampling time = 0.06 ms / 3 runs ( 0.02 ms per token, 53571.43 tokens per second)
llama_perf_context_print: load time = 32005.62 ms
llama_perf_context_print: prompt eval time = 456.36 ms / 2 tokens ( 228.18 ms per token, 4.38 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 471.29 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 33.222127697s
Run #3 status: 0
→ Avg over 3 runs: 32.796s
@@ -1,163 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-3.3-70B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.3-70B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: llama.block_count u32 = 80
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 8192
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 12: llama.attention.head_count u32 = 64
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 7
llama_model_loader: - kv 32: quantize.imatrix.file str = Llama-3.3-70B-Instruct-GGUF/imatrix_u...
llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Llama-3.3-70B-Ins...
llama_model_loader: - kv 34: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 35: quantize.imatrix.chunks_count i32 = 689
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.tensors.count i32 = 724
llama_model_loader: - kv 38: split.count u16 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q8_0: 455 tensors
llama_model_loader: - type bf16: 107 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 75.65 GiB (9.21 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 28672
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 70.55 B
print_info: general.name = Llama-3.3-70B-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: PAD token = 128004 '<|finetune_right_pad_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: ROCm0 model buffer size = 75456.53 MiB
load_tensors: ROCm_Host model buffer size = 2004.00 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.49 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 266.50 MiB
llama_context: ROCm_Host compute buffer size = 8.01 MiB
llama_context: graph nodes = 2647
llama_context: graph splits = 1
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 4130863841
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello:
llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 44117.65 tokens per second)
llama_perf_context_print: load time = 32184.35 ms
llama_perf_context_print: prompt eval time = 697.57 ms / 2 tokens ( 348.79 ms per token, 2.87 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 712.61 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 33.659541277s
Run #3 status: 0
→ Avg over 3 runs: 32.911s
@@ -1,161 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-3.3-70B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.3-70B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: llama.block_count u32 = 80
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 8192
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 12: llama.attention.head_count u32 = 64
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 7
llama_model_loader: - kv 32: quantize.imatrix.file str = Llama-3.3-70B-Instruct-GGUF/imatrix_u...
llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Llama-3.3-70B-Ins...
llama_model_loader: - kv 34: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 35: quantize.imatrix.chunks_count i32 = 689
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.tensors.count i32 = 724
llama_model_loader: - kv 38: split.count u16 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q8_0: 455 tensors
llama_model_loader: - type bf16: 107 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 75.65 GiB (9.21 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 28672
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 70.55 B
print_info: general.name = Llama-3.3-70B-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: PAD token = 128004 '<|finetune_right_pad_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: Vulkan0 model buffer size = 75456.53 MiB
load_tensors: Vulkan_Host model buffer size = 2004.00 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.49 MiB
llama_kv_cache_unified: Vulkan0 KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 266.50 MiB
llama_context: Vulkan_Host compute buffer size = 24.01 MiB
llama_context: graph nodes = 2647
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 327404797
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello,
llama_perf_sampler_print: sampling time = 0.06 ms / 3 runs ( 0.02 ms per token, 50847.46 tokens per second)
llama_perf_context_print: load time = 26953.87 ms
llama_perf_context_print: prompt eval time = 387.45 ms / 2 tokens ( 193.72 ms per token, 5.16 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 404.05 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 28.173844492s
Run #3 status: 0
→ Avg over 3 runs: 30.604s
@@ -1,161 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-70B-Instruct/UD-Q8_K_XL/Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-3.3-70B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.3-70B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: llama.block_count u32 = 80
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 8192
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 12: llama.attention.head_count u32 = 64
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 7
llama_model_loader: - kv 32: quantize.imatrix.file str = Llama-3.3-70B-Instruct-GGUF/imatrix_u...
llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Llama-3.3-70B-Ins...
llama_model_loader: - kv 34: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 35: quantize.imatrix.chunks_count i32 = 689
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.tensors.count i32 = 724
llama_model_loader: - kv 38: split.count u16 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q8_0: 455 tensors
llama_model_loader: - type bf16: 107 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 75.65 GiB (9.21 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 28672
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 70.55 B
print_info: general.name = Llama-3.3-70B-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: PAD token = 128004 '<|finetune_right_pad_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: Vulkan0 model buffer size = 75456.53 MiB
load_tensors: Vulkan_Host model buffer size = 2004.00 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.49 MiB
llama_kv_cache_unified: Vulkan0 KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 266.50 MiB
llama_context: Vulkan_Host compute buffer size = 24.01 MiB
llama_context: graph nodes = 2647
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 2154218339
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hellos
llama_perf_sampler_print: sampling time = 0.06 ms / 3 runs ( 0.02 ms per token, 51724.14 tokens per second)
llama_perf_context_print: load time = 29443.29 ms
llama_perf_context_print: prompt eval time = 376.13 ms / 2 tokens ( 188.07 ms per token, 5.32 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 392.17 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 30.227365941s
Run #3 status: 0
→ Avg over 3 runs: 30.376s
@@ -1,179 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 18
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 2
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q6_K: 482 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q6_K
print_info: file size = 82.35 GiB (6.56 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU model buffer size = 809.29 MiB
load_tensors: ROCm0 model buffer size = 83513.68 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 442.62 MiB
llama_context: ROCm_Host compute buffer size = 26.01 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 1642319140
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello
llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 42857.14 tokens per second)
llama_perf_context_print: load time = 26639.60 ms
llama_perf_context_print: prompt eval time = 107.52 ms / 2 tokens ( 53.76 ms per token, 18.60 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 127.12 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 30.905590182s
Run #3 status: 0
→ Avg over 3 runs: 31.792s
@@ -1,179 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 18
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 2
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q6_K: 482 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q6_K
print_info: file size = 82.35 GiB (6.56 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU model buffer size = 809.29 MiB
load_tensors: ROCm0 model buffer size = 83513.68 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 442.62 MiB
llama_context: ROCm_Host compute buffer size = 26.01 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 1329865451
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello1
llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 44776.12 tokens per second)
llama_perf_context_print: load time = 27337.52 ms
llama_perf_context_print: prompt eval time = 135.84 ms / 2 tokens ( 67.92 ms per token, 14.72 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 155.35 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 28.220065203s
Run #3 status: 0
→ Avg over 3 runs: 28.221s
@@ -1,179 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 18
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 2
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q6_K: 482 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q6_K
print_info: file size = 82.35 GiB (6.56 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU model buffer size = 809.29 MiB
load_tensors: ROCm0 model buffer size = 83513.68 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 442.62 MiB
llama_context: ROCm_Host compute buffer size = 26.01 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 3194189125
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello:
llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 46153.85 tokens per second)
llama_perf_context_print: load time = 26424.61 ms
llama_perf_context_print: prompt eval time = 106.73 ms / 2 tokens ( 53.37 ms per token, 18.74 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 126.53 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 27.353142250s
Run #3 status: 0
→ Avg over 3 runs: 28.435s
@@ -1,177 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 18
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 2
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q6_K: 482 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q6_K
print_info: file size = 82.35 GiB (6.56 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 83513.68 MiB
load_tensors: CPU model buffer size = 809.29 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 440.63 MiB
llama_context: Vulkan_Host compute buffer size = 26.01 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 4111748233
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello:
llama_perf_sampler_print: sampling time = 0.15 ms / 3 runs ( 0.05 ms per token, 20134.23 tokens per second)
llama_perf_context_print: load time = 31375.27 ms
llama_perf_context_print: prompt eval time = 267.76 ms / 2 tokens ( 133.88 ms per token, 7.47 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 295.92 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 33.122388042s
Run #3 status: 0
→ Avg over 3 runs: 35.541s
@@ -1,177 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q6_K/Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 18
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 2
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q6_K: 482 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q6_K
print_info: file size = 82.35 GiB (6.56 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 83513.68 MiB
load_tensors: CPU model buffer size = 809.29 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 440.63 MiB
llama_context: Vulkan_Host compute buffer size = 26.02 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 1422642604
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello1
llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 32967.03 tokens per second)
llama_perf_context_print: load time = 32072.23 ms
llama_perf_context_print: prompt eval time = 296.78 ms / 2 tokens ( 148.39 ms per token, 6.74 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 324.57 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 32.859879045s
Run #3 status: 0
→ Avg over 3 runs: 32.810s
@@ -1,179 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 7
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 3
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q8_0: 482 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 106.65 GiB (8.50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: ROCm0 model buffer size = 108165.12 MiB
load_tensors: ROCm_Host model buffer size = 1048.22 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 434.62 MiB
llama_context: ROCm_Host compute buffer size = 16.01 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 1
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 2885096603
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello.
llama_perf_sampler_print: sampling time = 0.06 ms / 3 runs ( 0.02 ms per token, 46875.00 tokens per second)
llama_perf_context_print: load time = 36882.65 ms
llama_perf_context_print: prompt eval time = 127.76 ms / 2 tokens ( 63.88 ms per token, 15.65 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 158.41 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 41.426125320s
Run #3 status: 0
→ Avg over 3 runs: 40.739s
@@ -1,179 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 7
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 3
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q8_0: 482 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 106.65 GiB (8.50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: ROCm0 model buffer size = 108165.12 MiB
load_tensors: ROCm_Host model buffer size = 1048.22 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 434.62 MiB
llama_context: ROCm_Host compute buffer size = 16.01 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 1
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 1149431120
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello:
llama_perf_sampler_print: sampling time = 0.06 ms / 3 runs ( 0.02 ms per token, 48387.10 tokens per second)
llama_perf_context_print: load time = 35959.68 ms
llama_perf_context_print: prompt eval time = 127.62 ms / 2 tokens ( 63.81 ms per token, 15.67 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 157.80 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 36.919182117s
Run #3 status: 0
→ Avg over 3 runs: 36.400s
@@ -1,179 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 7
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 3
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q8_0: 482 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 106.65 GiB (8.50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: ROCm0 model buffer size = 108165.12 MiB
load_tensors: ROCm_Host model buffer size = 1048.22 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 434.62 MiB
llama_context: ROCm_Host compute buffer size = 16.01 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 1
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 406280533
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello The
llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 45454.55 tokens per second)
llama_perf_context_print: load time = 34222.03 ms
llama_perf_context_print: prompt eval time = 136.79 ms / 2 tokens ( 68.40 ms per token, 14.62 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 156.58 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 35.217307205s
Run #3 status: 0
→ Avg over 3 runs: 35.742s
@@ -1,177 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 7
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 3
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q8_0: 482 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 106.65 GiB (8.50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 108165.12 MiB
load_tensors: Vulkan_Host model buffer size = 1048.22 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 440.63 MiB
llama_context: Vulkan_Host compute buffer size = 26.01 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 3690416473
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello
llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 32967.03 tokens per second)
llama_perf_context_print: load time = 41237.01 ms
llama_perf_context_print: prompt eval time = 233.96 ms / 2 tokens ( 116.98 ms per token, 8.55 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 261.97 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 45.548750208s
Run #3 status: 0
→ Avg over 3 runs: 47.967s
@@ -1,177 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 7
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 3
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q8_0: 482 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 106.65 GiB (8.50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 108165.12 MiB
load_tensors: Vulkan_Host model buffer size = 1048.22 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 440.63 MiB
llama_context: Vulkan_Host compute buffer size = 26.02 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 4068031204
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello
llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 32967.03 tokens per second)
llama_perf_context_print: load time = 41299.30 ms
llama_perf_context_print: prompt eval time = 252.99 ms / 2 tokens ( 126.49 ms per token, 7.91 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 280.67 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 42.081911936s
Run #3 status: 0
→ Avg over 3 runs: 41.626s
@@ -1,181 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 15
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 2
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q4_K: 421 tensors
llama_model_loader: - type q5_K: 43 tensors
llama_model_loader: - type q6_K: 18 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 57.73 GiB (4.60 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU model buffer size = 554.94 MiB
load_tensors: ROCm0 model buffer size = 58558.57 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 442.62 MiB
llama_context: ROCm_Host compute buffer size = 26.01 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 4182963810
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello The
llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 46153.85 tokens per second)
llama_perf_context_print: load time = 9663.18 ms
llama_perf_context_print: prompt eval time = 90.98 ms / 2 tokens ( 45.49 ms per token, 21.98 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 110.40 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 13.853856771s
Run #3 status: 0
→ Avg over 3 runs: 15.776s
@@ -1,162 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 15
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 2
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q4_K: 421 tensors
llama_model_loader: - type q5_K: 43 tensors
llama_model_loader: - type q6_K: 18 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 57.73 GiB (4.60 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU model buffer size = 554.94 MiB
load_tensors: ROCm0 model buffer size = 58558.57 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 442.62 MiB
llama_context: ROCm_Host compute buffer size = 26.01 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
HW Exception by GPU node-1 (Agent handle: 0x48fa1f0) reason :GPU Hang
Elapsed #3: 22.180402418s
Run #3 status: 134
✖ run #3 failed
→ No successful runs
@@ -1,174 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 15
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 2
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q4_K: 421 tensors
llama_model_loader: - type q5_K: 43 tensors
llama_model_loader: - type q6_K: 18 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 57.73 GiB (4.60 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU model buffer size = 554.94 MiB
load_tensors: ROCm0 model buffer size = 58558.57 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 442.62 MiB
llama_context: ROCm_Host compute buffer size = 26.01 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 722371466
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello Elapsed #3: 22.602610057s
Run #3 status: 134
✖ run #3 failed
→ Avg over 2 runs: 19.365s
@@ -1,179 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 15
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 2
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q4_K: 421 tensors
llama_model_loader: - type q5_K: 43 tensors
llama_model_loader: - type q6_K: 18 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 57.73 GiB (4.60 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 58558.57 MiB
load_tensors: CPU model buffer size = 554.94 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 440.63 MiB
llama_context: Vulkan_Host compute buffer size = 26.01 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 83044290
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello
llama_perf_sampler_print: sampling time = 0.16 ms / 3 runs ( 0.05 ms per token, 18518.52 tokens per second)
llama_perf_context_print: load time = 13560.35 ms
llama_perf_context_print: prompt eval time = 257.61 ms / 2 tokens ( 128.81 ms per token, 7.76 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 285.54 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 14.548378284s
Run #3 status: 0
→ Avg over 3 runs: 16.752s
@@ -1,179 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 628 tensors from /home/kyuz0/models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 3: general.finetune str = 16E-Instruct
llama_model_loader: - kv 4: general.basename str = Llama-4-Scout-17B-16E-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 17B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = llama4
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 4 Scout 17B 16E Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 15: general.languages arr[str,12] = ["ar", "de", "en", "es", "fr", "hi", ...
llama_model_loader: - kv 16: llama4.block_count u32 = 48
llama_model_loader: - kv 17: llama4.context_length u32 = 10485760
llama_model_loader: - kv 18: llama4.embedding_length u32 = 5120
llama_model_loader: - kv 19: llama4.feed_forward_length u32 = 16384
llama_model_loader: - kv 20: llama4.attention.head_count u32 = 40
llama_model_loader: - kv 21: llama4.attention.head_count_kv u32 = 8
llama_model_loader: - kv 22: llama4.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 23: llama4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: llama4.expert_count u32 = 16
llama_model_loader: - kv 25: llama4.expert_used_count u32 = 1
llama_model_loader: - kv 26: llama4.attention.key_length u32 = 128
llama_model_loader: - kv 27: llama4.attention.value_length u32 = 128
llama_model_loader: - kv 28: llama4.vocab_size u32 = 202048
llama_model_loader: - kv 29: llama4.rope.dimension_count u32 = 128
llama_model_loader: - kv 30: llama4.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 31: llama4.expert_feed_forward_length u32 = 8192
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = llama4
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,202048] = ["À", "Á", "õ", "ö", "÷", "ø", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,202048] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,439802] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 200000
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 200008
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 200018
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 15
llama_model_loader: - kv 44: quantize.imatrix.file str = Llama-4-Scout-17B-16E-Instruct-GGUF/i...
llama_model_loader: - kv 45: quantize.imatrix.dataset str = unsloth_calibration_Llama-4-Scout-17B...
llama_model_loader: - kv 46: quantize.imatrix.entries_count u32 = 528
llama_model_loader: - kv 47: quantize.imatrix.chunks_count u32 = 729
llama_model_loader: - kv 48: split.no u16 = 0
llama_model_loader: - kv 49: split.tensors.count i32 = 628
llama_model_loader: - kv 50: split.count u16 = 2
llama_model_loader: - type f32: 146 tensors
llama_model_loader: - type q4_K: 421 tensors
llama_model_loader: - type q5_K: 43 tensors
llama_model_loader: - type q6_K: 18 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 57.73 GiB (4.60 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1135
load: token to piece cache size = 1.3873 MB
print_info: arch = llama4
print_info: vocab_only = 0
print_info: n_ctx_train = 10485760
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 8192
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 16
print_info: n_expert_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 10485760
print_info: rope_finetuned = unknown
print_info: model type = 17Bx16E (Scout)
print_info: model params = 107.77 B
print_info: general.name = Llama-4-Scout-17B-16E-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 202048
print_info: n_merges = 439802
print_info: BOS token = 200000 '<|begin_of_text|>'
print_info: EOS token = 200008 '<|eot|>'
print_info: PAD token = 200018 '<|finetune_right_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 200002 '<|fim_prefix|>'
print_info: FIM SUF token = 200004 '<|fim_suffix|>'
print_info: FIM MID token = 200003 '<|fim_middle|>'
print_info: EOG token = 200001 '<|end_of_text|>'
print_info: EOG token = 200008 '<|eot|>'
print_info: max token length = 192
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 58558.57 MiB
load_tensors: CPU model buffer size = 554.94 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 192.00 MiB
llama_kv_cache_unified: size = 192.00 MiB ( 4096 cells, 12 layers, 1/ 1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 576.00 MiB
llama_kv_cache_unified: size = 576.00 MiB ( 4096 cells, 36 layers, 1/ 1 seqs), K (f16): 288.00 MiB, V (f16): 288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 440.63 MiB
llama_context: Vulkan_Host compute buffer size = 26.02 MiB
llama_context: graph nodes = 2420
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 2510811977
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello (
llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 32608.70 tokens per second)
llama_perf_context_print: load time = 16387.21 ms
llama_perf_context_print: prompt eval time = 291.47 ms / 2 tokens ( 145.73 ms per token, 6.86 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 319.42 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 17.154124582s
Run #3 status: 0
→ Avg over 3 runs: 20.045s
@@ -1,184 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1131 tensors from /home/kyuz0/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-235B-A22B-Instruct-2507
llama_model_loader: - kv 3: general.version str = 2507
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Qwen3-235B-A22B-Instruct-2507
llama_model_loader: - kv 6: general.quantized_by str = Unsloth
llama_model_loader: - kv 7: general.size_label str = 235B-A22B
llama_model_loader: - kv 8: general.license str = apache-2.0
llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv 10: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 11: general.base_model.count u32 = 1
llama_model_loader: - kv 12: general.base_model.0.name str = Qwen3 235B A22B Instruct 2507
llama_model_loader: - kv 13: general.base_model.0.version str = 2507
llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 17: qwen3moe.block_count u32 = 94
llama_model_loader: - kv 18: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 19: qwen3moe.embedding_length u32 = 4096
llama_model_loader: - kv 20: qwen3moe.feed_forward_length u32 = 12288
llama_model_loader: - kv 21: qwen3moe.attention.head_count u32 = 64
llama_model_loader: - kv 22: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 23: qwen3moe.rope.freq_base f32 = 5000000.000000
llama_model_loader: - kv 24: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 25: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 26: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 27: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 28: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 29: qwen3moe.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 31: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 38: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 39: general.quantization_version u32 = 2
llama_model_loader: - kv 40: general.file_type u32 = 12
llama_model_loader: - kv 41: quantize.imatrix.file str = Qwen3-235B-A22B-Instruct-2507-GGUF/im...
llama_model_loader: - kv 42: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-235B-A22B-I...
llama_model_loader: - kv 43: quantize.imatrix.entries_count u32 = 745
llama_model_loader: - kv 44: quantize.imatrix.chunks_count u32 = 693
llama_model_loader: - kv 45: split.no u16 = 0
llama_model_loader: - kv 46: split.tensors.count i32 = 1131
llama_model_loader: - kv 47: split.count u16 = 3
llama_model_loader: - type f32: 471 tensors
llama_model_loader: - type q3_K: 267 tensors
llama_model_loader: - type q4_K: 362 tensors
llama_model_loader: - type q5_K: 20 tensors
llama_model_loader: - type q6_K: 11 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q3_K - Medium
print_info: file size = 96.99 GiB (3.54 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 4096
print_info: n_layer = 94
print_info: n_head = 64
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 16
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 235B.A22B
print_info: model params = 235.09 B
print_info: general.name = Qwen3-235B-A22B-Instruct-2507
print_info: n_ff_exp = 1536
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 94 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 95/95 layers to GPU
load_tensors: CPU model buffer size = 333.84 MiB
load_tensors: ROCm0 model buffer size = 98988.40 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 752.00 MiB
llama_kv_cache_unified: size = 752.00 MiB ( 4096 cells, 94 layers, 1/ 1 seqs), K (f16): 376.00 MiB, V (f16): 376.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 304.75 MiB
llama_context: ROCm_Host compute buffer size = 16.01 MiB
llama_context: graph nodes = 6023
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 4068503868
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello,
llama_perf_sampler_print: sampling time = 0.06 ms / 2 runs ( 0.03 ms per token, 35087.72 tokens per second)
llama_perf_context_print: load time = 34531.90 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 74.04 ms / 1 runs ( 74.04 ms per token, 13.51 tokens per second)
llama_perf_context_print: total time = 87.46 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 38.606270419s
Run #3 status: 0
→ Avg over 3 runs: 39.062s
@@ -1,184 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1131 tensors from /home/kyuz0/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-235B-A22B-Instruct-2507
llama_model_loader: - kv 3: general.version str = 2507
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Qwen3-235B-A22B-Instruct-2507
llama_model_loader: - kv 6: general.quantized_by str = Unsloth
llama_model_loader: - kv 7: general.size_label str = 235B-A22B
llama_model_loader: - kv 8: general.license str = apache-2.0
llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv 10: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 11: general.base_model.count u32 = 1
llama_model_loader: - kv 12: general.base_model.0.name str = Qwen3 235B A22B Instruct 2507
llama_model_loader: - kv 13: general.base_model.0.version str = 2507
llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 17: qwen3moe.block_count u32 = 94
llama_model_loader: - kv 18: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 19: qwen3moe.embedding_length u32 = 4096
llama_model_loader: - kv 20: qwen3moe.feed_forward_length u32 = 12288
llama_model_loader: - kv 21: qwen3moe.attention.head_count u32 = 64
llama_model_loader: - kv 22: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 23: qwen3moe.rope.freq_base f32 = 5000000.000000
llama_model_loader: - kv 24: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 25: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 26: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 27: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 28: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 29: qwen3moe.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 31: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 38: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 39: general.quantization_version u32 = 2
llama_model_loader: - kv 40: general.file_type u32 = 12
llama_model_loader: - kv 41: quantize.imatrix.file str = Qwen3-235B-A22B-Instruct-2507-GGUF/im...
llama_model_loader: - kv 42: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-235B-A22B-I...
llama_model_loader: - kv 43: quantize.imatrix.entries_count u32 = 745
llama_model_loader: - kv 44: quantize.imatrix.chunks_count u32 = 693
llama_model_loader: - kv 45: split.no u16 = 0
llama_model_loader: - kv 46: split.tensors.count i32 = 1131
llama_model_loader: - kv 47: split.count u16 = 3
llama_model_loader: - type f32: 471 tensors
llama_model_loader: - type q3_K: 267 tensors
llama_model_loader: - type q4_K: 362 tensors
llama_model_loader: - type q5_K: 20 tensors
llama_model_loader: - type q6_K: 11 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q3_K - Medium
print_info: file size = 96.99 GiB (3.54 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 4096
print_info: n_layer = 94
print_info: n_head = 64
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 16
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 235B.A22B
print_info: model params = 235.09 B
print_info: general.name = Qwen3-235B-A22B-Instruct-2507
print_info: n_ff_exp = 1536
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 94 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 95/95 layers to GPU
load_tensors: CPU model buffer size = 333.84 MiB
load_tensors: ROCm0 model buffer size = 98988.40 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 752.00 MiB
llama_kv_cache_unified: size = 752.00 MiB ( 4096 cells, 94 layers, 1/ 1 seqs), K (f16): 376.00 MiB, V (f16): 376.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 304.75 MiB
llama_context: ROCm_Host compute buffer size = 16.01 MiB
llama_context: graph nodes = 6023
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 698255200
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello!
llama_perf_sampler_print: sampling time = 0.05 ms / 2 runs ( 0.03 ms per token, 37037.04 tokens per second)
llama_perf_context_print: load time = 34496.41 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 74.48 ms / 1 runs ( 74.48 ms per token, 13.43 tokens per second)
llama_perf_context_print: total time = 87.80 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 35.247053632s
Run #3 status: 0
→ Avg over 3 runs: 35.392s
@@ -1,184 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1131 tensors from /home/kyuz0/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-235B-A22B-Instruct-2507
llama_model_loader: - kv 3: general.version str = 2507
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Qwen3-235B-A22B-Instruct-2507
llama_model_loader: - kv 6: general.quantized_by str = Unsloth
llama_model_loader: - kv 7: general.size_label str = 235B-A22B
llama_model_loader: - kv 8: general.license str = apache-2.0
llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv 10: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 11: general.base_model.count u32 = 1
llama_model_loader: - kv 12: general.base_model.0.name str = Qwen3 235B A22B Instruct 2507
llama_model_loader: - kv 13: general.base_model.0.version str = 2507
llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 17: qwen3moe.block_count u32 = 94
llama_model_loader: - kv 18: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 19: qwen3moe.embedding_length u32 = 4096
llama_model_loader: - kv 20: qwen3moe.feed_forward_length u32 = 12288
llama_model_loader: - kv 21: qwen3moe.attention.head_count u32 = 64
llama_model_loader: - kv 22: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 23: qwen3moe.rope.freq_base f32 = 5000000.000000
llama_model_loader: - kv 24: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 25: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 26: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 27: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 28: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 29: qwen3moe.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 31: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 38: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 39: general.quantization_version u32 = 2
llama_model_loader: - kv 40: general.file_type u32 = 12
llama_model_loader: - kv 41: quantize.imatrix.file str = Qwen3-235B-A22B-Instruct-2507-GGUF/im...
llama_model_loader: - kv 42: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-235B-A22B-I...
llama_model_loader: - kv 43: quantize.imatrix.entries_count u32 = 745
llama_model_loader: - kv 44: quantize.imatrix.chunks_count u32 = 693
llama_model_loader: - kv 45: split.no u16 = 0
llama_model_loader: - kv 46: split.tensors.count i32 = 1131
llama_model_loader: - kv 47: split.count u16 = 3
llama_model_loader: - type f32: 471 tensors
llama_model_loader: - type q3_K: 267 tensors
llama_model_loader: - type q4_K: 362 tensors
llama_model_loader: - type q5_K: 20 tensors
llama_model_loader: - type q6_K: 11 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q3_K - Medium
print_info: file size = 96.99 GiB (3.54 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 4096
print_info: n_layer = 94
print_info: n_head = 64
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 16
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 235B.A22B
print_info: model params = 235.09 B
print_info: general.name = Qwen3-235B-A22B-Instruct-2507
print_info: n_ff_exp = 1536
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 94 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 95/95 layers to GPU
load_tensors: CPU model buffer size = 333.84 MiB
load_tensors: ROCm0 model buffer size = 98988.40 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 752.00 MiB
llama_kv_cache_unified: size = 752.00 MiB ( 4096 cells, 94 layers, 1/ 1 seqs), K (f16): 376.00 MiB, V (f16): 376.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 304.75 MiB
llama_context: ROCm_Host compute buffer size = 16.01 MiB
llama_context: graph nodes = 6023
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 715670654
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello,
llama_perf_sampler_print: sampling time = 0.06 ms / 2 runs ( 0.03 ms per token, 34482.76 tokens per second)
llama_perf_context_print: load time = 31968.90 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 73.79 ms / 1 runs ( 73.79 ms per token, 13.55 tokens per second)
llama_perf_context_print: total time = 87.27 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 32.781452355s
Run #3 status: 0
→ Avg over 3 runs: 33.458s
@@ -1,182 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1131 tensors from /home/kyuz0/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-235B-A22B-Instruct-2507
llama_model_loader: - kv 3: general.version str = 2507
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Qwen3-235B-A22B-Instruct-2507
llama_model_loader: - kv 6: general.quantized_by str = Unsloth
llama_model_loader: - kv 7: general.size_label str = 235B-A22B
llama_model_loader: - kv 8: general.license str = apache-2.0
llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv 10: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 11: general.base_model.count u32 = 1
llama_model_loader: - kv 12: general.base_model.0.name str = Qwen3 235B A22B Instruct 2507
llama_model_loader: - kv 13: general.base_model.0.version str = 2507
llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 17: qwen3moe.block_count u32 = 94
llama_model_loader: - kv 18: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 19: qwen3moe.embedding_length u32 = 4096
llama_model_loader: - kv 20: qwen3moe.feed_forward_length u32 = 12288
llama_model_loader: - kv 21: qwen3moe.attention.head_count u32 = 64
llama_model_loader: - kv 22: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 23: qwen3moe.rope.freq_base f32 = 5000000.000000
llama_model_loader: - kv 24: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 25: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 26: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 27: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 28: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 29: qwen3moe.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 31: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 38: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 39: general.quantization_version u32 = 2
llama_model_loader: - kv 40: general.file_type u32 = 12
llama_model_loader: - kv 41: quantize.imatrix.file str = Qwen3-235B-A22B-Instruct-2507-GGUF/im...
llama_model_loader: - kv 42: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-235B-A22B-I...
llama_model_loader: - kv 43: quantize.imatrix.entries_count u32 = 745
llama_model_loader: - kv 44: quantize.imatrix.chunks_count u32 = 693
llama_model_loader: - kv 45: split.no u16 = 0
llama_model_loader: - kv 46: split.tensors.count i32 = 1131
llama_model_loader: - kv 47: split.count u16 = 3
llama_model_loader: - type f32: 471 tensors
llama_model_loader: - type q3_K: 267 tensors
llama_model_loader: - type q4_K: 362 tensors
llama_model_loader: - type q5_K: 20 tensors
llama_model_loader: - type q6_K: 11 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q3_K - Medium
print_info: file size = 96.99 GiB (3.54 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 4096
print_info: n_layer = 94
print_info: n_head = 64
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 16
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 235B.A22B
print_info: model params = 235.09 B
print_info: general.name = Qwen3-235B-A22B-Instruct-2507
print_info: n_ff_exp = 1536
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 94 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 95/95 layers to GPU
load_tensors: Vulkan0 model buffer size = 98988.40 MiB
load_tensors: CPU model buffer size = 333.84 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: Vulkan0 KV buffer size = 752.00 MiB
llama_kv_cache_unified: size = 752.00 MiB ( 4096 cells, 94 layers, 1/ 1 seqs), K (f16): 376.00 MiB, V (f16): 376.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 304.75 MiB
llama_context: Vulkan_Host compute buffer size = 16.01 MiB
llama_context: graph nodes = 6023
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 4076614647
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello,
llama_perf_sampler_print: sampling time = 0.07 ms / 2 runs ( 0.04 ms per token, 28571.43 tokens per second)
llama_perf_context_print: load time = 40072.88 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 67.40 ms / 1 runs ( 67.40 ms per token, 14.84 tokens per second)
llama_perf_context_print: total time = 86.12 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 43.569299668s
Run #3 status: 0
→ Avg over 3 runs: 44.883s
@@ -1,182 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1131 tensors from /home/kyuz0/models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-235B-A22B-Instruct-2507
llama_model_loader: - kv 3: general.version str = 2507
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Qwen3-235B-A22B-Instruct-2507
llama_model_loader: - kv 6: general.quantized_by str = Unsloth
llama_model_loader: - kv 7: general.size_label str = 235B-A22B
llama_model_loader: - kv 8: general.license str = apache-2.0
llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv 10: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 11: general.base_model.count u32 = 1
llama_model_loader: - kv 12: general.base_model.0.name str = Qwen3 235B A22B Instruct 2507
llama_model_loader: - kv 13: general.base_model.0.version str = 2507
llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 17: qwen3moe.block_count u32 = 94
llama_model_loader: - kv 18: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 19: qwen3moe.embedding_length u32 = 4096
llama_model_loader: - kv 20: qwen3moe.feed_forward_length u32 = 12288
llama_model_loader: - kv 21: qwen3moe.attention.head_count u32 = 64
llama_model_loader: - kv 22: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 23: qwen3moe.rope.freq_base f32 = 5000000.000000
llama_model_loader: - kv 24: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 25: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 26: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 27: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 28: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 29: qwen3moe.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 31: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 38: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 39: general.quantization_version u32 = 2
llama_model_loader: - kv 40: general.file_type u32 = 12
llama_model_loader: - kv 41: quantize.imatrix.file str = Qwen3-235B-A22B-Instruct-2507-GGUF/im...
llama_model_loader: - kv 42: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-235B-A22B-I...
llama_model_loader: - kv 43: quantize.imatrix.entries_count u32 = 745
llama_model_loader: - kv 44: quantize.imatrix.chunks_count u32 = 693
llama_model_loader: - kv 45: split.no u16 = 0
llama_model_loader: - kv 46: split.tensors.count i32 = 1131
llama_model_loader: - kv 47: split.count u16 = 3
llama_model_loader: - type f32: 471 tensors
llama_model_loader: - type q3_K: 267 tensors
llama_model_loader: - type q4_K: 362 tensors
llama_model_loader: - type q5_K: 20 tensors
llama_model_loader: - type q6_K: 11 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q3_K - Medium
print_info: file size = 96.99 GiB (3.54 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 4096
print_info: n_layer = 94
print_info: n_head = 64
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 16
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 235B.A22B
print_info: model params = 235.09 B
print_info: general.name = Qwen3-235B-A22B-Instruct-2507
print_info: n_ff_exp = 1536
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 94 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 95/95 layers to GPU
load_tensors: Vulkan0 model buffer size = 98988.40 MiB
load_tensors: CPU model buffer size = 333.84 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: Vulkan0 KV buffer size = 752.00 MiB
llama_kv_cache_unified: size = 752.00 MiB ( 4096 cells, 94 layers, 1/ 1 seqs), K (f16): 376.00 MiB, V (f16): 376.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 304.75 MiB
llama_context: Vulkan_Host compute buffer size = 16.01 MiB
llama_context: graph nodes = 6023
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 1959920459
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello,
llama_perf_sampler_print: sampling time = 0.08 ms / 2 runs ( 0.04 ms per token, 25641.03 tokens per second)
llama_perf_context_print: load time = 40114.24 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 67.08 ms / 1 runs ( 67.08 ms per token, 14.91 tokens per second)
llama_perf_context_print: total time = 86.46 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 40.621909942s
Run #3 status: 0
→ Avg over 3 runs: 40.722s
@@ -1,167 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from /home/kyuz0/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B
llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 30B-A3B
llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960
llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144
llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 32
llama_model_loader: - kv 19: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 20: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 31: split.no u16 = 0
llama_model_loader: - kv 32: split.count u16 = 2
llama_model_loader: - kv 33: split.tensors.count i32 = 579
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type bf16: 338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 56.89 GiB (16.01 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 40960
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 6144
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 40960
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3-30B-A3B
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: ROCm0 model buffer size = 57666.30 MiB
load_tensors: ROCm_Host model buffer size = 593.50 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 384.00 MiB
llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 300.75 MiB
llama_context: ROCm_Host compute buffer size = 8.01 MiB
llama_context: graph nodes = 3079
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 1093628111
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello -
llama_perf_sampler_print: sampling time = 0.06 ms / 2 runs ( 0.03 ms per token, 34482.76 tokens per second)
llama_perf_context_print: load time = 19374.51 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 42.85 ms / 1 runs ( 42.85 ms per token, 23.34 tokens per second)
llama_perf_context_print: total time = 73.04 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 23.364750813s
Run #3 status: 0
→ Avg over 3 runs: 22.166s
@@ -1,167 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from /home/kyuz0/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B
llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 30B-A3B
llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960
llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144
llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 32
llama_model_loader: - kv 19: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 20: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 31: split.no u16 = 0
llama_model_loader: - kv 32: split.count u16 = 2
llama_model_loader: - kv 33: split.tensors.count i32 = 579
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type bf16: 338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 56.89 GiB (16.01 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 40960
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 6144
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 40960
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3-30B-A3B
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: ROCm0 model buffer size = 57666.30 MiB
load_tensors: ROCm_Host model buffer size = 593.50 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 384.00 MiB
llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 300.75 MiB
llama_context: ROCm_Host compute buffer size = 8.01 MiB
llama_context: graph nodes = 3079
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 3515911169
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello *
llama_perf_sampler_print: sampling time = 0.05 ms / 2 runs ( 0.03 ms per token, 37037.04 tokens per second)
llama_perf_context_print: load time = 12423.68 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 43.15 ms / 1 runs ( 43.15 ms per token, 23.18 tokens per second)
llama_perf_context_print: total time = 62.68 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 13.032265401s
Run #3 status: 0
→ Avg over 3 runs: 15.930s
@@ -1,167 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from /home/kyuz0/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B
llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 30B-A3B
llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960
llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144
llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 32
llama_model_loader: - kv 19: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 20: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 31: split.no u16 = 0
llama_model_loader: - kv 32: split.count u16 = 2
llama_model_loader: - kv 33: split.tensors.count i32 = 579
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type bf16: 338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 56.89 GiB (16.01 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 40960
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 6144
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 40960
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3-30B-A3B
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: ROCm0 model buffer size = 57666.30 MiB
load_tensors: ROCm_Host model buffer size = 593.50 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 384.00 MiB
llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 300.75 MiB
llama_context: ROCm_Host compute buffer size = 8.01 MiB
llama_context: graph nodes = 3079
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 4057380724
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello this
llama_perf_sampler_print: sampling time = 0.05 ms / 2 runs ( 0.03 ms per token, 37037.04 tokens per second)
llama_perf_context_print: load time = 21106.31 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 43.24 ms / 1 runs ( 43.24 ms per token, 23.13 tokens per second)
llama_perf_context_print: total time = 62.41 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 21.852416396s
Run #3 status: 0
→ Avg over 3 runs: 22.669s
@@ -1,165 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from /home/kyuz0/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B
llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 30B-A3B
llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960
llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144
llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 32
llama_model_loader: - kv 19: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 20: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 31: split.no u16 = 0
llama_model_loader: - kv 32: split.count u16 = 2
llama_model_loader: - kv 33: split.tensors.count i32 = 579
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type bf16: 338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 56.89 GiB (16.01 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 40960
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 6144
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 40960
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3-30B-A3B
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 57666.30 MiB
load_tensors: Vulkan_Host model buffer size = 593.50 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: Vulkan0 KV buffer size = 384.00 MiB
llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 304.75 MiB
llama_context: Vulkan_Host compute buffer size = 12.01 MiB
llama_context: graph nodes = 3079
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 157667903
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello and
llama_perf_sampler_print: sampling time = 0.08 ms / 2 runs ( 0.04 ms per token, 24390.24 tokens per second)
llama_perf_context_print: load time = 10008.37 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 128.73 ms / 1 runs ( 128.73 ms per token, 7.77 tokens per second)
llama_perf_context_print: total time = 155.88 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 10.759732568s
Run #3 status: 0
→ Avg over 3 runs: 12.935s
@@ -1,165 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from /home/kyuz0/models/qwen-3-30B-A3B/BF16/Qwen3-30B-A3B-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B
llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 30B-A3B
llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960
llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144
llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 32
llama_model_loader: - kv 19: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 20: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 31: split.no u16 = 0
llama_model_loader: - kv 32: split.count u16 = 2
llama_model_loader: - kv 33: split.tensors.count i32 = 579
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type bf16: 338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 56.89 GiB (16.01 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 40960
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 6144
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 40960
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3-30B-A3B
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 57666.30 MiB
load_tensors: Vulkan_Host model buffer size = 593.50 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: Vulkan0 KV buffer size = 384.00 MiB
llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 304.75 MiB
llama_context: Vulkan_Host compute buffer size = 12.01 MiB
llama_context: graph nodes = 3079
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 1118253234
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello -
llama_perf_sampler_print: sampling time = 0.08 ms / 2 runs ( 0.04 ms per token, 25316.46 tokens per second)
llama_perf_context_print: load time = 12501.96 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 137.49 ms / 1 runs ( 137.49 ms per token, 7.27 tokens per second)
llama_perf_context_print: total time = 164.69 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 13.022605949s
Run #3 status: 0
→ Avg over 3 runs: 14.761s
@@ -1,176 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 43 key-value pairs and 579 tensors from /home/kyuz0/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 30B-A3B
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 Coder 30B A3B Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 18: qwen3moe.feed_forward_length u32 = 5472
llama_model_loader: - kv 19: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 20: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 21: qwen3moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 22: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 23: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 24: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 25: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 26: general.file_type u32 = 32
llama_model_loader: - kv 27: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 28: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 29: qwen3moe.expert_shared_feed_forward_length u32 = 0
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 32: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 33: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 37: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 38: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 39: tokenizer.chat_template str = {#- Copyright 2025-present the Unslot...
llama_model_loader: - kv 40: split.no u16 = 0
llama_model_loader: - kv 41: split.count u16 = 2
llama_model_loader: - kv 42: split.tensors.count i32 = 579
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type bf16: 338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 56.89 GiB (16.01 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 5472
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3-Coder-30B-A3B-Instruct
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: ROCm0 model buffer size = 57666.30 MiB
load_tensors: ROCm_Host model buffer size = 593.50 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 384.00 MiB
llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 300.75 MiB
llama_context: ROCm_Host compute buffer size = 8.01 MiB
llama_context: graph nodes = 3079
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 3288748167
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello:
llama_perf_sampler_print: sampling time = 0.05 ms / 2 runs ( 0.03 ms per token, 38461.54 tokens per second)
llama_perf_context_print: load time = 12175.61 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 42.43 ms / 1 runs ( 42.43 ms per token, 23.57 tokens per second)
llama_perf_context_print: total time = 81.77 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 16.099845533s
Run #3 status: 0
→ Avg over 3 runs: 17.779s
@@ -1,176 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 43 key-value pairs and 579 tensors from /home/kyuz0/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 30B-A3B
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 Coder 30B A3B Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 18: qwen3moe.feed_forward_length u32 = 5472
llama_model_loader: - kv 19: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 20: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 21: qwen3moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 22: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 23: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 24: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 25: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 26: general.file_type u32 = 32
llama_model_loader: - kv 27: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 28: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 29: qwen3moe.expert_shared_feed_forward_length u32 = 0
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 32: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 33: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 37: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 38: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 39: tokenizer.chat_template str = {#- Copyright 2025-present the Unslot...
llama_model_loader: - kv 40: split.no u16 = 0
llama_model_loader: - kv 41: split.count u16 = 2
llama_model_loader: - kv 42: split.tensors.count i32 = 579
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type bf16: 338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 56.89 GiB (16.01 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 5472
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3-Coder-30B-A3B-Instruct
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: ROCm0 model buffer size = 57666.30 MiB
load_tensors: ROCm_Host model buffer size = 593.50 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 384.00 MiB
llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 300.75 MiB
llama_context: ROCm_Host compute buffer size = 8.01 MiB
llama_context: graph nodes = 3079
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 3173540432
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello:
llama_perf_sampler_print: sampling time = 0.06 ms / 2 runs ( 0.03 ms per token, 35087.72 tokens per second)
llama_perf_context_print: load time = 11733.11 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 42.68 ms / 1 runs ( 42.68 ms per token, 23.43 tokens per second)
llama_perf_context_print: total time = 82.14 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 12.376138939s
Run #3 status: 0
→ Avg over 3 runs: 14.392s
@@ -1,176 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 43 key-value pairs and 579 tensors from /home/kyuz0/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 30B-A3B
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 Coder 30B A3B Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 18: qwen3moe.feed_forward_length u32 = 5472
llama_model_loader: - kv 19: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 20: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 21: qwen3moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 22: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 23: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 24: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 25: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 26: general.file_type u32 = 32
llama_model_loader: - kv 27: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 28: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 29: qwen3moe.expert_shared_feed_forward_length u32 = 0
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 32: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 33: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 37: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 38: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 39: tokenizer.chat_template str = {#- Copyright 2025-present the Unslot...
llama_model_loader: - kv 40: split.no u16 = 0
llama_model_loader: - kv 41: split.count u16 = 2
llama_model_loader: - kv 42: split.tensors.count i32 = 579
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type bf16: 338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 56.89 GiB (16.01 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 5472
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3-Coder-30B-A3B-Instruct
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: ROCm0 model buffer size = 57666.30 MiB
load_tensors: ROCm_Host model buffer size = 593.50 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 384.00 MiB
llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 300.75 MiB
llama_context: ROCm_Host compute buffer size = 8.01 MiB
llama_context: graph nodes = 3079
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 1388157865
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello:
llama_perf_sampler_print: sampling time = 0.06 ms / 2 runs ( 0.03 ms per token, 36363.64 tokens per second)
llama_perf_context_print: load time = 11788.33 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 43.56 ms / 1 runs ( 43.56 ms per token, 22.95 tokens per second)
llama_perf_context_print: total time = 82.77 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 12.528214562s
Run #3 status: 0
→ Avg over 3 runs: 16.161s
@@ -1,174 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 43 key-value pairs and 579 tensors from /home/kyuz0/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 30B-A3B
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 Coder 30B A3B Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 18: qwen3moe.feed_forward_length u32 = 5472
llama_model_loader: - kv 19: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 20: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 21: qwen3moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 22: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 23: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 24: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 25: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 26: general.file_type u32 = 32
llama_model_loader: - kv 27: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 28: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 29: qwen3moe.expert_shared_feed_forward_length u32 = 0
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 32: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 33: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 37: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 38: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 39: tokenizer.chat_template str = {#- Copyright 2025-present the Unslot...
llama_model_loader: - kv 40: split.no u16 = 0
llama_model_loader: - kv 41: split.count u16 = 2
llama_model_loader: - kv 42: split.tensors.count i32 = 579
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type bf16: 338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 56.89 GiB (16.01 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 5472
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3-Coder-30B-A3B-Instruct
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 57666.30 MiB
load_tensors: Vulkan_Host model buffer size = 593.50 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: Vulkan0 KV buffer size = 384.00 MiB
llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 304.75 MiB
llama_context: Vulkan_Host compute buffer size = 12.01 MiB
llama_context: graph nodes = 3079
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 243266880
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello:
llama_perf_sampler_print: sampling time = 0.08 ms / 2 runs ( 0.04 ms per token, 26315.79 tokens per second)
llama_perf_context_print: load time = 9973.02 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 130.78 ms / 1 runs ( 130.78 ms per token, 7.65 tokens per second)
llama_perf_context_print: total time = 185.17 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 10.756452016s
Run #3 status: 0
→ Avg over 3 runs: 12.940s
@@ -1,174 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 43 key-value pairs and 579 tensors from /home/kyuz0/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 30B-A3B
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 Coder 30B A3B Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 18: qwen3moe.feed_forward_length u32 = 5472
llama_model_loader: - kv 19: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 20: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 21: qwen3moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 22: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 23: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 24: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 25: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 26: general.file_type u32 = 32
llama_model_loader: - kv 27: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 28: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 29: qwen3moe.expert_shared_feed_forward_length u32 = 0
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 32: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 33: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 37: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 38: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 39: tokenizer.chat_template str = {#- Copyright 2025-present the Unslot...
llama_model_loader: - kv 40: split.no u16 = 0
llama_model_loader: - kv 41: split.count u16 = 2
llama_model_loader: - kv 42: split.tensors.count i32 = 579
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type bf16: 338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 56.89 GiB (16.01 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 5472
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3-Coder-30B-A3B-Instruct
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 57666.30 MiB
load_tensors: Vulkan_Host model buffer size = 593.50 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: Vulkan0 KV buffer size = 384.00 MiB
llama_kv_cache_unified: size = 384.00 MiB ( 4096 cells, 48 layers, 1/ 1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 304.75 MiB
llama_context: Vulkan_Host compute buffer size = 12.01 MiB
llama_context: graph nodes = 3079
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 2350977163
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 0
Hello:
llama_perf_sampler_print: sampling time = 0.07 ms / 2 runs ( 0.04 ms per token, 27027.03 tokens per second)
llama_perf_context_print: load time = 13008.56 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 140.05 ms / 1 runs ( 140.05 ms per token, 7.14 tokens per second)
llama_perf_context_print: total time = 194.09 ms / 2 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 13.570267879s
Run #3 status: 0
→ Avg over 3 runs: 14.021s
@@ -1,165 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 626 tensors from /home/kyuz0/models/gemma-3-12b-it-UD-Q8_K_XL/gemma-3-12b-it-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-12B-It
llama_model_loader: - kv 3: general.finetune str = it
llama_model_loader: - kv 4: general.basename str = Gemma-3-12B-It
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 12B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: gemma3.context_length u32 = 131072
llama_model_loader: - kv 9: gemma3.embedding_length u32 = 3840
llama_model_loader: - kv 10: gemma3.block_count u32 = 48
llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 15360
llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 16
llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 21: tokenizer.ggml.model str = llama
llama_model_loader: - kv 22: tokenizer.ggml.pre str = default
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - kv 35: general.file_type u32 = 7
llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-12b-it-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-12b-it.txt
llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 336
llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663
llama_model_loader: - type f32: 289 tensors
llama_model_loader: - type q8_0: 311 tensors
llama_model_loader: - type bf16: 26 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 13.40 GiB (9.78 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 3840
print_info: n_layer = 48
print_info: n_head = 16
print_info: n_head_kv = 8
print_info: n_rot = 256
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 6.2e-02
print_info: n_ff = 15360
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 12B
print_info: model params = 11.77 B
print_info: general.name = Gemma-3-12B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: ROCm0 model buffer size = 13721.20 MiB
load_tensors: ROCm_Host model buffer size = 1920.47 MiB
.............................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 256.00 MiB
llama_kv_cache_unified: size = 256.00 MiB ( 4096 cells, 8 layers, 1/ 1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 480.00 MiB
llama_kv_cache_unified: size = 480.00 MiB ( 1536 cells, 40 layers, 1/ 1 seqs), K (f16): 240.00 MiB, V (f16): 240.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 519.62 MiB
llama_context: ROCm_Host compute buffer size = 11.01 MiB
llama_context: graph nodes = 2025
llama_context: graph splits = 1
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 3471752321
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello**
llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 35294.12 tokens per second)
llama_perf_context_print: load time = 2510.88 ms
llama_perf_context_print: prompt eval time = 74.99 ms / 2 tokens ( 37.49 ms per token, 26.67 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 79.74 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 6.594391168s
Run #3 status: 0
→ Avg over 3 runs: 6.686s
@@ -1,165 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 626 tensors from /home/kyuz0/models/gemma-3-12b-it-UD-Q8_K_XL/gemma-3-12b-it-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-12B-It
llama_model_loader: - kv 3: general.finetune str = it
llama_model_loader: - kv 4: general.basename str = Gemma-3-12B-It
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 12B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: gemma3.context_length u32 = 131072
llama_model_loader: - kv 9: gemma3.embedding_length u32 = 3840
llama_model_loader: - kv 10: gemma3.block_count u32 = 48
llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 15360
llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 16
llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 21: tokenizer.ggml.model str = llama
llama_model_loader: - kv 22: tokenizer.ggml.pre str = default
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - kv 35: general.file_type u32 = 7
llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-12b-it-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-12b-it.txt
llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 336
llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663
llama_model_loader: - type f32: 289 tensors
llama_model_loader: - type q8_0: 311 tensors
llama_model_loader: - type bf16: 26 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 13.40 GiB (9.78 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 3840
print_info: n_layer = 48
print_info: n_head = 16
print_info: n_head_kv = 8
print_info: n_rot = 256
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 6.2e-02
print_info: n_ff = 15360
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 12B
print_info: model params = 11.77 B
print_info: general.name = Gemma-3-12B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: ROCm0 model buffer size = 13721.20 MiB
load_tensors: ROCm_Host model buffer size = 1920.47 MiB
.............................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 256.00 MiB
llama_kv_cache_unified: size = 256.00 MiB ( 4096 cells, 8 layers, 1/ 1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 480.00 MiB
llama_kv_cache_unified: size = 480.00 MiB ( 1536 cells, 40 layers, 1/ 1 seqs), K (f16): 240.00 MiB, V (f16): 240.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 519.62 MiB
llama_context: ROCm_Host compute buffer size = 11.01 MiB
llama_context: graph nodes = 2025
llama_context: graph splits = 1
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 854716185
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
HelloWhat
llama_perf_sampler_print: sampling time = 0.14 ms / 3 runs ( 0.05 ms per token, 21428.57 tokens per second)
llama_perf_context_print: load time = 2695.72 ms
llama_perf_context_print: prompt eval time = 75.18 ms / 2 tokens ( 37.59 ms per token, 26.60 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 82.57 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 3.208919123s
Run #3 status: 0
→ Avg over 3 runs: 3.434s
@@ -1,165 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 626 tensors from /home/kyuz0/models/gemma-3-12b-it-UD-Q8_K_XL/gemma-3-12b-it-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-12B-It
llama_model_loader: - kv 3: general.finetune str = it
llama_model_loader: - kv 4: general.basename str = Gemma-3-12B-It
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 12B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: gemma3.context_length u32 = 131072
llama_model_loader: - kv 9: gemma3.embedding_length u32 = 3840
llama_model_loader: - kv 10: gemma3.block_count u32 = 48
llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 15360
llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 16
llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 21: tokenizer.ggml.model str = llama
llama_model_loader: - kv 22: tokenizer.ggml.pre str = default
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - kv 35: general.file_type u32 = 7
llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-12b-it-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-12b-it.txt
llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 336
llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663
llama_model_loader: - type f32: 289 tensors
llama_model_loader: - type q8_0: 311 tensors
llama_model_loader: - type bf16: 26 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 13.40 GiB (9.78 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 3840
print_info: n_layer = 48
print_info: n_head = 16
print_info: n_head_kv = 8
print_info: n_rot = 256
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 6.2e-02
print_info: n_ff = 15360
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 12B
print_info: model params = 11.77 B
print_info: general.name = Gemma-3-12B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: ROCm0 model buffer size = 13721.20 MiB
load_tensors: ROCm_Host model buffer size = 1920.47 MiB
.............................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 256.00 MiB
llama_kv_cache_unified: size = 256.00 MiB ( 4096 cells, 8 layers, 1/ 1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 480.00 MiB
llama_kv_cache_unified: size = 480.00 MiB ( 1536 cells, 40 layers, 1/ 1 seqs), K (f16): 240.00 MiB, V (f16): 240.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 519.62 MiB
llama_context: ROCm_Host compute buffer size = 11.01 MiB
llama_context: graph nodes = 2025
llama_context: graph splits = 1
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 754281730
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
HelloThe
llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 32608.70 tokens per second)
llama_perf_context_print: load time = 3090.57 ms
llama_perf_context_print: prompt eval time = 75.62 ms / 2 tokens ( 37.81 ms per token, 26.45 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 81.49 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 3.616272374s
Run #3 status: 0
→ Avg over 3 runs: 3.861s
@@ -1,163 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 626 tensors from /home/kyuz0/models/gemma-3-12b-it-UD-Q8_K_XL/gemma-3-12b-it-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-12B-It
llama_model_loader: - kv 3: general.finetune str = it
llama_model_loader: - kv 4: general.basename str = Gemma-3-12B-It
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 12B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: gemma3.context_length u32 = 131072
llama_model_loader: - kv 9: gemma3.embedding_length u32 = 3840
llama_model_loader: - kv 10: gemma3.block_count u32 = 48
llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 15360
llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 16
llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 21: tokenizer.ggml.model str = llama
llama_model_loader: - kv 22: tokenizer.ggml.pre str = default
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - kv 35: general.file_type u32 = 7
llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-12b-it-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-12b-it.txt
llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 336
llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663
llama_model_loader: - type f32: 289 tensors
llama_model_loader: - type q8_0: 311 tensors
llama_model_loader: - type bf16: 26 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 13.40 GiB (9.78 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 3840
print_info: n_layer = 48
print_info: n_head = 16
print_info: n_head_kv = 8
print_info: n_rot = 256
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 6.2e-02
print_info: n_ff = 15360
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 12B
print_info: model params = 11.77 B
print_info: general.name = Gemma-3-12B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 13721.12 MiB
load_tensors: Vulkan_Host model buffer size = 1920.47 MiB
.............................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 256.00 MiB
llama_kv_cache_unified: size = 256.00 MiB ( 4096 cells, 8 layers, 1/ 1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 480.00 MiB
llama_kv_cache_unified: size = 480.00 MiB ( 1536 cells, 40 layers, 1/ 1 seqs), K (f16): 240.00 MiB, V (f16): 240.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 519.62 MiB
llama_context: Vulkan_Host compute buffer size = 18.51 MiB
llama_context: graph nodes = 2025
llama_context: graph splits = 2
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 356896032
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello
llama_perf_sampler_print: sampling time = 0.12 ms / 3 runs ( 0.04 ms per token, 24390.24 tokens per second)
llama_perf_context_print: load time = 3459.76 ms
llama_perf_context_print: prompt eval time = 90.54 ms / 2 tokens ( 45.27 ms per token, 22.09 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 98.48 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 3.933674345s
Run #3 status: 0
→ Avg over 3 runs: 3.955s
@@ -1,163 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 626 tensors from /home/kyuz0/models/gemma-3-12b-it-UD-Q8_K_XL/gemma-3-12b-it-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-12B-It
llama_model_loader: - kv 3: general.finetune str = it
llama_model_loader: - kv 4: general.basename str = Gemma-3-12B-It
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 12B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: gemma3.context_length u32 = 131072
llama_model_loader: - kv 9: gemma3.embedding_length u32 = 3840
llama_model_loader: - kv 10: gemma3.block_count u32 = 48
llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 15360
llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 16
llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 21: tokenizer.ggml.model str = llama
llama_model_loader: - kv 22: tokenizer.ggml.pre str = default
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - kv 35: general.file_type u32 = 7
llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-12b-it-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-12b-it.txt
llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 336
llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663
llama_model_loader: - type f32: 289 tensors
llama_model_loader: - type q8_0: 311 tensors
llama_model_loader: - type bf16: 26 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 13.40 GiB (9.78 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 3840
print_info: n_layer = 48
print_info: n_head = 16
print_info: n_head_kv = 8
print_info: n_rot = 256
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 6.2e-02
print_info: n_ff = 15360
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 12B
print_info: model params = 11.77 B
print_info: general.name = Gemma-3-12B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 13721.12 MiB
load_tensors: Vulkan_Host model buffer size = 1920.47 MiB
.............................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 256.00 MiB
llama_kv_cache_unified: size = 256.00 MiB ( 4096 cells, 8 layers, 1/ 1 seqs), K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 480.00 MiB
llama_kv_cache_unified: size = 480.00 MiB ( 1536 cells, 40 layers, 1/ 1 seqs), K (f16): 240.00 MiB, V (f16): 240.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 519.62 MiB
llama_context: Vulkan_Host compute buffer size = 18.51 MiB
llama_context: graph nodes = 2025
llama_context: graph splits = 2
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 3541901199
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
HelloI
llama_perf_sampler_print: sampling time = 0.12 ms / 3 runs ( 0.04 ms per token, 24590.16 tokens per second)
llama_perf_context_print: load time = 3946.08 ms
llama_perf_context_print: prompt eval time = 78.51 ms / 2 tokens ( 39.26 ms per token, 25.47 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 86.43 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 4.313578800s
Run #3 status: 0
→ Avg over 3 runs: 4.295s
@@ -1,164 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 808 tensors from /home/kyuz0/models/gemma-3-27b-it-BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It
llama_model_loader: - kv 3: general.finetune str = it
llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 27B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: gemma3.context_length u32 = 131072
llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376
llama_model_loader: - kv 10: gemma3.block_count u32 = 62
llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504
llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32
llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128
llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128
llama_model_loader: - kv 16: general.file_type u32 = 32
llama_model_loader: - kv 17: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 18: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 19: gemma3.attention.head_count_kv u32 = 16
llama_model_loader: - kv 20: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 21: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - kv 23: tokenizer.ggml.model str = llama
llama_model_loader: - kv 24: tokenizer.ggml.pre str = default
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 33: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 34: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 35: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.count u16 = 2
llama_model_loader: - kv 38: split.tensors.count i32 = 808
llama_model_loader: - type f32: 373 tensors
llama_model_loader: - type bf16: 435 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 50.31 GiB (16.00 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 5376
print_info: n_layer = 62
print_info: n_head = 32
print_info: n_head_kv = 16
print_info: n_rot = 128
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 7.7e-02
print_info: n_ff = 21504
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 27B
print_info: model params = 27.01 B
print_info: general.name = Gemma-3-27B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 62 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 63/63 layers to GPU
load_tensors: ROCm0 model buffer size = 51518.82 MiB
load_tensors: ROCm_Host model buffer size = 2688.66 MiB
.............................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 320.00 MiB
llama_kv_cache_unified: size = 320.00 MiB ( 4096 cells, 10 layers, 1/ 1 seqs), K (f16): 160.00 MiB, V (f16): 160.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 624.00 MiB
llama_kv_cache_unified: size = 624.00 MiB ( 1536 cells, 52 layers, 1/ 1 seqs), K (f16): 312.00 MiB, V (f16): 312.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 522.62 MiB
llama_context: ROCm_Host compute buffer size = 11.01 MiB
llama_context: graph nodes = 2613
llama_context: graph splits = 1
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 204092650
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello
llama_perf_sampler_print: sampling time = 0.08 ms / 3 runs ( 0.03 ms per token, 39473.68 tokens per second)
llama_perf_context_print: load time = 7815.59 ms
llama_perf_context_print: prompt eval time = 253.33 ms / 2 tokens ( 126.66 ms per token, 7.89 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 258.00 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 11.830337249s
Run #3 status: 0
→ Avg over 3 runs: 12.495s
@@ -1,164 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 808 tensors from /home/kyuz0/models/gemma-3-27b-it-BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It
llama_model_loader: - kv 3: general.finetune str = it
llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 27B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: gemma3.context_length u32 = 131072
llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376
llama_model_loader: - kv 10: gemma3.block_count u32 = 62
llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504
llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32
llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128
llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128
llama_model_loader: - kv 16: general.file_type u32 = 32
llama_model_loader: - kv 17: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 18: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 19: gemma3.attention.head_count_kv u32 = 16
llama_model_loader: - kv 20: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 21: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - kv 23: tokenizer.ggml.model str = llama
llama_model_loader: - kv 24: tokenizer.ggml.pre str = default
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 33: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 34: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 35: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.count u16 = 2
llama_model_loader: - kv 38: split.tensors.count i32 = 808
llama_model_loader: - type f32: 373 tensors
llama_model_loader: - type bf16: 435 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 50.31 GiB (16.00 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 5376
print_info: n_layer = 62
print_info: n_head = 32
print_info: n_head_kv = 16
print_info: n_rot = 128
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 7.7e-02
print_info: n_ff = 21504
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 27B
print_info: model params = 27.01 B
print_info: general.name = Gemma-3-27B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 62 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 63/63 layers to GPU
load_tensors: ROCm0 model buffer size = 51518.82 MiB
load_tensors: ROCm_Host model buffer size = 2688.66 MiB
.............................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 320.00 MiB
llama_kv_cache_unified: size = 320.00 MiB ( 4096 cells, 10 layers, 1/ 1 seqs), K (f16): 160.00 MiB, V (f16): 160.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 624.00 MiB
llama_kv_cache_unified: size = 624.00 MiB ( 1536 cells, 52 layers, 1/ 1 seqs), K (f16): 312.00 MiB, V (f16): 312.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 522.62 MiB
llama_context: ROCm_Host compute buffer size = 11.01 MiB
llama_context: graph nodes = 2613
llama_context: graph splits = 1
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 88592582
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello,
llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 35294.12 tokens per second)
llama_perf_context_print: load time = 10385.57 ms
llama_perf_context_print: prompt eval time = 253.71 ms / 2 tokens ( 126.85 ms per token, 7.88 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 259.35 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 11.144656718s
Run #3 status: 0
→ Avg over 3 runs: 10.486s
@@ -1,164 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 808 tensors from /home/kyuz0/models/gemma-3-27b-it-BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It
llama_model_loader: - kv 3: general.finetune str = it
llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 27B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: gemma3.context_length u32 = 131072
llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376
llama_model_loader: - kv 10: gemma3.block_count u32 = 62
llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504
llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32
llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128
llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128
llama_model_loader: - kv 16: general.file_type u32 = 32
llama_model_loader: - kv 17: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 18: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 19: gemma3.attention.head_count_kv u32 = 16
llama_model_loader: - kv 20: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 21: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - kv 23: tokenizer.ggml.model str = llama
llama_model_loader: - kv 24: tokenizer.ggml.pre str = default
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 33: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 34: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 35: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.count u16 = 2
llama_model_loader: - kv 38: split.tensors.count i32 = 808
llama_model_loader: - type f32: 373 tensors
llama_model_loader: - type bf16: 435 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 50.31 GiB (16.00 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 5376
print_info: n_layer = 62
print_info: n_head = 32
print_info: n_head_kv = 16
print_info: n_rot = 128
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 7.7e-02
print_info: n_ff = 21504
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 27B
print_info: model params = 27.01 B
print_info: general.name = Gemma-3-27B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 62 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 63/63 layers to GPU
load_tensors: ROCm0 model buffer size = 51518.82 MiB
load_tensors: ROCm_Host model buffer size = 2688.66 MiB
.............................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 320.00 MiB
llama_kv_cache_unified: size = 320.00 MiB ( 4096 cells, 10 layers, 1/ 1 seqs), K (f16): 160.00 MiB, V (f16): 160.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells
llama_kv_cache_unified: ROCm0 KV buffer size = 624.00 MiB
llama_kv_cache_unified: size = 624.00 MiB ( 1536 cells, 52 layers, 1/ 1 seqs), K (f16): 312.00 MiB, V (f16): 312.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 522.62 MiB
llama_context: ROCm_Host compute buffer size = 11.01 MiB
llama_context: graph nodes = 2613
llama_context: graph splits = 1
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 1422263455
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello,
llama_perf_sampler_print: sampling time = 0.09 ms / 3 runs ( 0.03 ms per token, 35294.12 tokens per second)
llama_perf_context_print: load time = 9620.16 ms
llama_perf_context_print: prompt eval time = 256.55 ms / 2 tokens ( 128.27 ms per token, 7.80 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 261.63 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 10.587027979s
Run #3 status: 0
→ Avg over 3 runs: 10.417s
@@ -1,113 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 808 tensors from /home/kyuz0/models/gemma-3-27b-it-BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It
llama_model_loader: - kv 3: general.finetune str = it
llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 27B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: gemma3.context_length u32 = 131072
llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376
llama_model_loader: - kv 10: gemma3.block_count u32 = 62
llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504
llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32
llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128
llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128
llama_model_loader: - kv 16: general.file_type u32 = 32
llama_model_loader: - kv 17: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 18: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 19: gemma3.attention.head_count_kv u32 = 16
llama_model_loader: - kv 20: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 21: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - kv 23: tokenizer.ggml.model str = llama
llama_model_loader: - kv 24: tokenizer.ggml.pre str = default
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 33: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 34: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 35: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.count u16 = 2
llama_model_loader: - kv 38: split.tensors.count i32 = 808
llama_model_loader: - type f32: 373 tensors
llama_model_loader: - type bf16: 435 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 50.31 GiB (16.00 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 5376
print_info: n_layer = 62
print_info: n_head = 32
print_info: n_head_kv = 16
print_info: n_rot = 128
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 7.7e-02
print_info: n_ff = 21504
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 27B
print_info: model params = 27.01 B
print_info: general.name = Gemma-3-27B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false)
ggml_vulkan: Device memory allocation of size 2819260416 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
alloc_tensor_range: failed to allocate Vulkan0 buffer of size 2819260416
llama_model_load: error loading model: unable to allocate Vulkan0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/home/kyuz0/models/gemma-3-27b-it-BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf'
main: error: unable to load model
Elapsed #3: .416644024s
Run #3 status: 1
✖ run #3 failed
→ No successful runs
@@ -1,162 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 808 tensors from /home/kyuz0/models/gemma-3-27b-it-BF16/gemma-3-27b-it-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-27B-It
llama_model_loader: - kv 3: general.finetune str = it
llama_model_loader: - kv 4: general.basename str = Gemma-3-27B-It
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 27B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: gemma3.context_length u32 = 131072
llama_model_loader: - kv 9: gemma3.embedding_length u32 = 5376
llama_model_loader: - kv 10: gemma3.block_count u32 = 62
llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 21504
llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 32
llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 128
llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 128
llama_model_loader: - kv 16: general.file_type u32 = 32
llama_model_loader: - kv 17: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 18: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 19: gemma3.attention.head_count_kv u32 = 16
llama_model_loader: - kv 20: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 21: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - kv 23: tokenizer.ggml.model str = llama
llama_model_loader: - kv 24: tokenizer.ggml.pre str = default
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 33: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 34: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 35: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 36: split.no u16 = 0
llama_model_loader: - kv 37: split.count u16 = 2
llama_model_loader: - kv 38: split.tensors.count i32 = 808
llama_model_loader: - type f32: 373 tensors
llama_model_loader: - type bf16: 435 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = BF16
print_info: file size = 50.31 GiB (16.00 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 5376
print_info: n_layer = 62
print_info: n_head = 32
print_info: n_head_kv = 16
print_info: n_rot = 128
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 7.7e-02
print_info: n_ff = 21504
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 27B
print_info: model params = 27.01 B
print_info: general.name = Gemma-3-27B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 62 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 63/63 layers to GPU
load_tensors: Vulkan0 model buffer size = 51518.82 MiB
load_tensors: Vulkan_Host model buffer size = 2688.66 MiB
.............................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 320.00 MiB
llama_kv_cache_unified: size = 320.00 MiB ( 4096 cells, 10 layers, 1/ 1 seqs), K (f16): 160.00 MiB, V (f16): 160.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells
llama_kv_cache_unified: Vulkan0 KV buffer size = 624.00 MiB
llama_kv_cache_unified: size = 624.00 MiB ( 1536 cells, 52 layers, 1/ 1 seqs), K (f16): 312.00 MiB, V (f16): 312.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 522.62 MiB
llama_context: Vulkan_Host compute buffer size = 21.51 MiB
llama_context: graph nodes = 2613
llama_context: graph splits = 2
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 4215263583
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello,
llama_perf_sampler_print: sampling time = 0.18 ms / 3 runs ( 0.06 ms per token, 16666.67 tokens per second)
llama_perf_context_print: load time = 14451.51 ms
llama_perf_context_print: prompt eval time = 257.32 ms / 2 tokens ( 128.66 ms per token, 7.77 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 265.56 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 15.024330058s
Run #3 status: 0
→ Avg over 3 runs: 13.579s
@@ -1,159 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (Radeon 8060S Graphics) - 124522 MiB free
llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.1 70B Instruct 2024 12
llama_model_loader: - kv 3: general.version str = 2024-12
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Llama-3.1
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: general.license str = llama3.1
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Llama 3.1 70B
llama_model_loader: - kv 10: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 12: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 13: general.languages arr[str,7] = ["fr", "it", "pt", "hi", "es", "th", ...
llama_model_loader: - kv 14: llama.block_count u32 = 80
llama_model_loader: - kv 15: llama.context_length u32 = 131072
llama_model_loader: - kv 16: llama.embedding_length u32 = 8192
llama_model_loader: - kv 17: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 18: llama.attention.head_count u32 = 64
llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 20: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 22: llama.attention.key_length u32 = 128
llama_model_loader: - kv 23: llama.attention.value_length u32 = 128
llama_model_loader: - kv 24: general.file_type u32 = 15
llama_model_loader: - kv 25: llama.vocab_size u32 = 128256
llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 28: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q4_K: 441 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 81 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 39.59 GiB (4.82 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 28672
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 70.55 B
print_info: general.name = Llama 3.1 70B Instruct 2024 12
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: CPU model buffer size = 563.62 MiB
load_tensors: ROCm0 model buffer size = 39979.48 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.49 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 266.50 MiB
llama_context: ROCm_Host compute buffer size = 24.01 MiB
llama_context: graph nodes = 2647
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 1295757489
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello,
llama_perf_sampler_print: sampling time = 0.05 ms / 3 runs ( 0.02 ms per token, 61224.49 tokens per second)
llama_perf_context_print: load time = 5592.62 ms
llama_perf_context_print: prompt eval time = 248.28 ms / 2 tokens ( 124.14 ms per token, 8.06 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 263.25 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 9.635053314s
Run #3 status: 0
→ Avg over 3 runs: 9.887s
@@ -1,159 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.1 70B Instruct 2024 12
llama_model_loader: - kv 3: general.version str = 2024-12
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Llama-3.1
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: general.license str = llama3.1
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Llama 3.1 70B
llama_model_loader: - kv 10: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 12: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 13: general.languages arr[str,7] = ["fr", "it", "pt", "hi", "es", "th", ...
llama_model_loader: - kv 14: llama.block_count u32 = 80
llama_model_loader: - kv 15: llama.context_length u32 = 131072
llama_model_loader: - kv 16: llama.embedding_length u32 = 8192
llama_model_loader: - kv 17: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 18: llama.attention.head_count u32 = 64
llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 20: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 22: llama.attention.key_length u32 = 128
llama_model_loader: - kv 23: llama.attention.value_length u32 = 128
llama_model_loader: - kv 24: general.file_type u32 = 15
llama_model_loader: - kv 25: llama.vocab_size u32 = 128256
llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 28: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q4_K: 441 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 81 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 39.59 GiB (4.82 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 28672
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 70.55 B
print_info: general.name = Llama 3.1 70B Instruct 2024 12
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: CPU model buffer size = 563.62 MiB
load_tensors: ROCm0 model buffer size = 39979.48 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.49 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 266.50 MiB
llama_context: ROCm_Host compute buffer size = 24.01 MiB
llama_context: graph nodes = 2647
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 3791928713
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello.
llama_perf_sampler_print: sampling time = 0.05 ms / 3 runs ( 0.02 ms per token, 57692.31 tokens per second)
llama_perf_context_print: load time = 6133.42 ms
llama_perf_context_print: prompt eval time = 247.67 ms / 2 tokens ( 123.83 ms per token, 8.08 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 268.37 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 6.904239282s
Run #3 status: 0
→ Avg over 3 runs: 9.338s
@@ -1,159 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6066 (4cb208c9) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 124523 MiB free
llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.1 70B Instruct 2024 12
llama_model_loader: - kv 3: general.version str = 2024-12
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Llama-3.1
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: general.license str = llama3.1
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Llama 3.1 70B
llama_model_loader: - kv 10: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 12: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 13: general.languages arr[str,7] = ["fr", "it", "pt", "hi", "es", "th", ...
llama_model_loader: - kv 14: llama.block_count u32 = 80
llama_model_loader: - kv 15: llama.context_length u32 = 131072
llama_model_loader: - kv 16: llama.embedding_length u32 = 8192
llama_model_loader: - kv 17: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 18: llama.attention.head_count u32 = 64
llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 20: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 22: llama.attention.key_length u32 = 128
llama_model_loader: - kv 23: llama.attention.value_length u32 = 128
llama_model_loader: - kv 24: general.file_type u32 = 15
llama_model_loader: - kv 25: llama.vocab_size u32 = 128256
llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 28: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q4_K: 441 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 81 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 39.59 GiB (4.82 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 28672
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 70.55 B
print_info: general.name = Llama 3.1 70B Instruct 2024 12
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: CPU model buffer size = 563.62 MiB
load_tensors: ROCm0 model buffer size = 39979.48 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.49 MiB
llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: ROCm0 compute buffer size = 266.50 MiB
llama_context: ROCm_Host compute buffer size = 24.01 MiB
llama_context: graph nodes = 2647
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 59935472
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello.
llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 46153.85 tokens per second)
llama_perf_context_print: load time = 12737.72 ms
llama_perf_context_print: prompt eval time = 291.99 ms / 2 tokens ( 145.99 ms per token, 6.85 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 306.96 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 13.680764475s
Run #3 status: 0
→ Avg over 3 runs: 14.602s
@@ -1,157 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
build: 6060 (9c35706b) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics) - 85720 MiB free
llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.1 70B Instruct 2024 12
llama_model_loader: - kv 3: general.version str = 2024-12
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Llama-3.1
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: general.license str = llama3.1
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Llama 3.1 70B
llama_model_loader: - kv 10: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 12: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 13: general.languages arr[str,7] = ["fr", "it", "pt", "hi", "es", "th", ...
llama_model_loader: - kv 14: llama.block_count u32 = 80
llama_model_loader: - kv 15: llama.context_length u32 = 131072
llama_model_loader: - kv 16: llama.embedding_length u32 = 8192
llama_model_loader: - kv 17: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 18: llama.attention.head_count u32 = 64
llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 20: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 22: llama.attention.key_length u32 = 128
llama_model_loader: - kv 23: llama.attention.value_length u32 = 128
llama_model_loader: - kv 24: general.file_type u32 = 15
llama_model_loader: - kv 25: llama.vocab_size u32 = 128256
llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 28: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q4_K: 441 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 81 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 39.59 GiB (4.82 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 28672
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 70.55 B
print_info: general.name = Llama 3.1 70B Instruct 2024 12
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: Vulkan0 model buffer size = 39979.48 MiB
load_tensors: CPU model buffer size = 563.62 MiB
..................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.49 MiB
llama_kv_cache_unified: Vulkan0 KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 266.50 MiB
llama_context: Vulkan_Host compute buffer size = 24.01 MiB
llama_context: graph nodes = 2647
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 1976378490
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello,
llama_perf_sampler_print: sampling time = 0.08 ms / 3 runs ( 0.03 ms per token, 36585.37 tokens per second)
llama_perf_context_print: load time = 6987.06 ms
llama_perf_context_print: prompt eval time = 210.77 ms / 2 tokens ( 105.39 ms per token, 9.49 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 232.45 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 7.786884955s
Run #3 status: 0
→ Avg over 3 runs: 9.176s
@@ -1,157 +0,0 @@
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
build: 6040 (66625a59) with cc (GCC) 15.1.1 20250719 (Red Hat 15.1.1-5) for x86_64-redhat-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 87722 MiB free
llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from /home/kyuz0/models/llama-3.3-Q4_K_M/llama3.3-70.6B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.1 70B Instruct 2024 12
llama_model_loader: - kv 3: general.version str = 2024-12
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Llama-3.1
llama_model_loader: - kv 6: general.size_label str = 70B
llama_model_loader: - kv 7: general.license str = llama3.1
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Llama 3.1 70B
llama_model_loader: - kv 10: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 12: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 13: general.languages arr[str,7] = ["fr", "it", "pt", "hi", "es", "th", ...
llama_model_loader: - kv 14: llama.block_count u32 = 80
llama_model_loader: - kv 15: llama.context_length u32 = 131072
llama_model_loader: - kv 16: llama.embedding_length u32 = 8192
llama_model_loader: - kv 17: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 18: llama.attention.head_count u32 = 64
llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 20: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 22: llama.attention.key_length u32 = 128
llama_model_loader: - kv 23: llama.attention.value_length u32 = 128
llama_model_loader: - kv 24: general.file_type u32 = 15
llama_model_loader: - kv 25: llama.vocab_size u32 = 128256
llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 28: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q4_K: 441 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 81 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 39.59 GiB (4.82 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 28672
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 70B
print_info: model params = 70.55 B
print_info: general.name = Llama 3.1 70B Instruct 2024 12
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: Vulkan0 model buffer size = 39979.48 MiB
load_tensors: CPU model buffer size = 563.62 MiB
..................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.49 MiB
llama_kv_cache_unified: Vulkan0 KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1/ 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: Vulkan0 compute buffer size = 266.50 MiB
llama_context: Vulkan_Host compute buffer size = 24.01 MiB
llama_context: graph nodes = 2647
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 2613669910
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 1, n_keep = 1
Hello's
llama_perf_sampler_print: sampling time = 0.07 ms / 3 runs ( 0.02 ms per token, 40540.54 tokens per second)
llama_perf_context_print: load time = 8119.06 ms
llama_perf_context_print: prompt eval time = 204.01 ms / 2 tokens ( 102.01 ms per token, 9.80 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 225.18 ms / 3 tokens
llama_perf_context_print: graphs reused = 0
Elapsed #3: 8.699816033s
Run #3 status: 0
→ Avg over 3 runs: 8.816s
-71
View File
@@ -1,71 +0,0 @@
#!/usr/bin/env python3
"""
Parse the console output of run_loadtime_benchmarks.sh stored in run_loadtime_benchmarks.log,
then produce a Markdown table of average load+inference times per model/env.
"""
import re
from collections import defaultdict, OrderedDict
import sys
LOGFILE = 'run_loadtime_benchmark.log'
# Define expected environments in desired column order
ENV_ORDER = ['vulkan_radv','vulkan_amdvlk','rocm6_4_2','rocm7_beta','rocm7_rc']
# Regex patterns
ENTRY_RE = re.compile(r"\[(?P<env>[^]]+)\] (?P<model>[^ ]+) avg=(?P<avg>[0-9.]+)s over (?P<n>[0-9]+) runs")
FAIL_RE = re.compile(r"\[(?P<env>[^]]+)\] (?P<model>[^ ]+) all runs failed")
# Data containers
results = defaultdict(lambda: {}) # results[model][env] = float or 'ERR'
# Read and parse log
with open(LOGFILE) as f:
for line in f:
line = line.strip()
m = ENTRY_RE.match(line)
if m:
env = m.group('env')
model = m.group('model')
avg = float(m.group('avg'))
results[model][env] = avg
continue
m2 = FAIL_RE.match(line)
if m2:
env = m2.group('env')
model = m2.group('model')
results[model][env] = None # indicate failure
# Compute winner per model: smallest time
md_lines = []
# Header
header = ['Model'] + [e.replace('_',' ').title() for e in ENV_ORDER] + ['Fastest']
md_lines.append('| ' + ' | '.join(header) + ' |')
md_lines.append('|' + '|'.join(['---']*len(header)) + '|')
for model in sorted(results, key=lambda s: s.lower()):
row = [f"**{model}**"]
env_times = results[model]
# find fastest
valid = {e:env_times[e] for e in ENV_ORDER if e in env_times and env_times[e] is not None}
if valid:
best_env = min(valid, key=lambda k: valid[k])
fastest = f"🏆 **{best_env}**"
else:
fastest = ''
for env in ENV_ORDER:
if env not in env_times:
cell = ''
else:
t = env_times[env]
if t is None:
cell = '⚠️ Fail'
else:
cell = f"{t:.2f}s"
row.append(cell)
row.append(fastest)
md_lines.append('| ' + ' | '.join(row) + ' |')
# Print markdown
table = '\n'.join(md_lines)
print(table)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
Memory access fault by GPU node-1 (Agent handle: 0x275a2540) on address 0x7f3fb2c08000. Reason: Page not present or supervisor privilege.
✖ ! [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x25d19540) reason :GPU Hang
✖ ! [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002 __fa1 failed (exit 134)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| glm4moe 106B.A12B Q4_K - Medium | 68.01 GiB | 110.47 B | ROCm | 99 | 0 | pp512 | 131.14 ± 0.28 |
| glm4moe 106B.A12B Q4_K - Medium | 68.01 GiB | 110.47 B | ROCm | 99 | 0 | tg128 | 20.15 ± 0.01 |
build: de219279 (6181)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| glm4moe 106B.A12B Q4_K - Medium | 68.01 GiB | 110.47 B | ROCm | 99 | 1 | 0 | pp512 | 104.12 ± 0.05 |
| glm4moe 106B.A12B Q4_K - Medium | 68.01 GiB | 110.47 B | ROCm | 99 | 1 | 0 | tg128 | 20.35 ± 0.00 |
build: de219279 (6181)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x3e28b540) reason :GPU Hang
✖ ! [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
Memory access fault by GPU node-1 (Agent handle: 0x2bdf8540) on address 0x7f5f95e35000. Reason: Page not present or supervisor privilege.
✖ ! [rocm6_4_2-rocwmma] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 __fa1 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x3ff2d540) reason :GPU Hang
✖ ! [rocm6_4_2] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x3bb3540) reason :GPU Hang
✖ ! [rocm6_4_2] GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003 __fa1 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x33b8a540) reason :GPU Hang
✖ ! [rocm6_4_2-rocwmma] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x20e35540) reason :GPU Hang
✖ ! [rocm6_4_2-rocwmma] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 __fa1 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x1b1ea540) reason :GPU Hang
✖ ! [rocm6_4_2] Llama-3.3-70B-Instruct-UD-Q8_K_XL-00001-of-00002 failed (exit 134)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| llama 70B Q8_0 | 75.65 GiB | 70.55 B | ROCm | 99 | 1 | 0 | pp512 | 16.16 ± 0.02 |
| llama 70B Q8_0 | 75.65 GiB | 70.55 B | ROCm | 99 | 1 | 0 | tg128 | 2.78 ± 0.00 |
build: de219279 (6181)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x344ea540) reason :GPU Hang
✖ ! [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0xe316540) reason :GPU Hang
✖ ! [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 __fa1 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x17ade540) reason :GPU Hang
✖ ! [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0xe91f540) reason :GPU Hang
✖ ! [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q6_K-00001-of-00002 __fa1 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x1019d540) reason :GPU Hang
✖ ! [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x2ff5c540) reason :GPU Hang
✖ ! [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x3db80540) reason :GPU Hang
✖ ! [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x24a4c540) reason :GPU Hang
✖ ! [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003 __fa1 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
Memory access fault by GPU node-1 (Agent handle: 0x3e5ce540) on address 0x7f64d3b76000. Reason: Page not present or supervisor privilege.
✖ ! [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x1239e540) reason :GPU Hang
✖ ! [rocm6_4_2-rocwmma] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 __fa1 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x101f4540) reason :GPU Hang
✖ ! [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
Memory access fault by GPU node-1 (Agent handle: 0x15f12540) on address 0x7ef17d976000. Reason: Page not present or supervisor privilege.
✖ ! [rocm6_4_2] Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002 __fa1 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x2f5d1540) reason :GPU Hang
✖ ! [rocm6_4_2-rocwmma] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0xdc93540) reason :GPU Hang
✖ ! [rocm6_4_2-rocwmma] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 __fa1 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0xff7540) reason :GPU Hang
✖ ! [rocm6_4_2] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 failed (exit 134)
@@ -1,6 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
HW Exception by GPU node-1 (Agent handle: 0x2607e540) reason :GPU Hang
✖ ! [rocm6_4_2] Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003 __fa1 failed (exit 134)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 0 | pp512 | 157.75 ± 2.58 |
| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 0 | tg128 | 24.62 ± 0.00 |
build: de219279 (6181)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 1 | 0 | pp512 | 161.90 ± 3.05 |
| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 1 | 0 | tg128 | 24.09 ± 0.02 |
build: de219279 (6181)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 0 | pp512 | 157.81 ± 2.51 |
| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 0 | tg128 | 24.61 ± 0.01 |
build: de219279 (6181)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 1 | 0 | pp512 | 140.24 ± 1.86 |
| qwen3moe 30B.A3B BF16 | 56.89 GiB | 30.53 B | ROCm | 99 | 1 | 0 | tg128 | 24.46 ± 0.02 |
build: de219279 (6181)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 0 | pp512 | 387.23 ± 0.82 |
| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 0 | tg128 | 50.64 ± 0.01 |
build: de219279 (6181)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 1 | 0 | pp512 | 411.72 ± 1.04 |
| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 1 | 0 | tg128 | 48.78 ± 0.00 |
build: de219279 (6181)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 0 | pp512 | 387.86 ± 1.41 |
| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 0 | tg128 | 50.65 ± 0.01 |
build: de219279 (6181)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 1 | 0 | pp512 | 301.23 ± 0.49 |
| qwen3moe 30B.A3B Q6_K | 24.53 GiB | 30.53 B | ROCm | 99 | 1 | 0 | tg128 | 50.07 ± 0.02 |
build: de219279 (6181)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 0 | pp512 | 222.91 ± 0.21 |
| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 0 | tg128 | 14.03 ± 0.00 |
build: de219279 (6181)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 1 | 0 | pp512 | 229.15 ± 0.24 |
| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 1 | 0 | tg128 | 13.76 ± 0.00 |
build: de219279 (6181)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 0 | pp512 | 222.59 ± 0.24 |
| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 0 | tg128 | 14.03 ± 0.00 |
build: de219279 (6181)
@@ -1,10 +0,0 @@
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 1 | 0 | pp512 | 197.89 ± 3.40 |
| gemma3 12B Q8_0 | 13.40 GiB | 11.77 B | ROCm | 99 | 1 | 0 | tg128 | 13.76 ± 0.00 |
build: de219279 (6181)

Some files were not shown because too many files have changed in this diff Show More