Donato Capitella 995ad2cd38 Updated benchmarks
2025-08-09 11:50:27 +01:00
2025-08-09 11:50:27 +01:00
2025-08-09 11:50:27 +01:00
2025-08-09 11:50:27 +01:00
2025-08-09 10:31:39 +01:00

AMD Strix Halo Llama.cpp Toolboxes

This project provides pre-built containers (“toolboxes”) for running LLMs on AMD Ryzen AI Max “Strix Halo” integrated GPUs. Toolbx is the standard developer container system in Fedora (and now works on Ubuntu, openSUSE, Arch, etc).

Watch the YouTube Video

Watch the YouTube Video

Why Toolbx?

  • Reproducible: never pollute your host system
  • Seamless: shares your home and GPU devices, works like a native shell
  • Flexible: easy to switch between Vulkan (open/closed drivers) and ROCm

Table of Contents

  1. Llama.cpp Compiled for Every Backend
    1.1 Supported Container Images
  2. Quickest Usage Example
    2.1 Creating the toolboxes with GPU access
    2.2 Running models inside the toolboxes
    2.3 Downloading GGUF Models from HuggingFace
  3. Performance Benchmarks (Key Results)
  4. Memory Planning & VRAM Estimator
  5. Building Containers Locally
  6. Host Configuration
    6.1 Test Configuration
    6.2 Kernel Parameters (tested on Fedora 42)
    6.3 Ubuntu 24.04
  7. More Documentation
  8. References

1. Llama.cpp Compiled for Every Backend

This project uses Llama.cpp, a high-performance inference engine for running local LLMs (large language models) on CPUs and GPUs. Llama.cpp is open source, extremely fast, and is the only engine supporting all key backends for AMD Strix Halo: Vulkan (RADV, AMDVLK) and ROCm/HIP

  • Vulkan is a cross-platform, low-level graphics and compute API. Llama.cpp can use Vulkan for GPU inference with either the open Mesa RADV driver or AMD's "official" open AMDVLK driver. This is the most stable and supported option for AMD CPUs at the moment.
  • ROCm is AMD's open-source answer to CUDA: a GPU compute stack for machine learning and HPC. With ROCm, you can run Llama.cpp on AMD GPUs in a way similar to how CUDA works on NVIDIA - this is not the most stable/mature, but recently it's been getting better.

1.1 Supported Container Images

Container Tag Backend/Stack Purpose / Notes
vulkan-amdvlk Vulkan (AMDVLK) Fastest backend—AMD open-source driver. ≤2 GiB single buffer allocation limit, some large models won't load.
vulkan-radv Vulkan (Mesa RADV) Most stable and compatible. Recommended for most users and all models.
rocm-6.4.2 ROCm 6.4.2 (HIP) Latest stable ROCm. Great for BF16 models. Occasional crashes possible.
rocm-6.4.2-rocwaam ROCm 6.4.2 (HIP) + ROCWMMA ROCm with ROCWMMA enabled for improved flash attention on RDNA3+/CDNA.
rocm-7beta ROCm 7.0 Beta (HIP) Latest ROCm beta. No real gain for Llama.cpp. Same model limits as 6.4.2.
rocm-7rc ROCm 7.0 RC (HIP) Release candidate for ROCm 7.0. Same behavior as beta.

You can also check the containers on DockerHub: https://hub.docker.com/r/kyuz0/amd-strix-halo-toolboxes/tags.

These containers are automatically rebuilt whenever the Llama.cpp master branch is updated, ensuring you get the latest bug fixes and new model support. The easiest way to update to the newest versions is by running the refresh-toolboxes.sh script below.

Each container is based on Fedora Rawhide and is built for maximum compatibility and performance on Strix Halo.


2. Quickest Usage Example

2.1 Creating the toolboxes with GPU access

To use Llama.cpp with hardware acceleration inside a toolbox container, you must expose the GPU devices from your host. The exact flags and devices depend on the backend:

  • For Vulkan (RADV/AMDVLK): Only /dev/dri is required. Add the user to the video group for access to GPU devices.

    toolbox create llama-vulkan-radv \
      --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
      -- --device /dev/dri --group-add video --security-opt seccomp=unconfined
    
  • For ROCm: You must expose both /dev/dri and /dev/kfd, and add the user to extra groups for compute access.

    toolbox create llama-rocm-6.4.2 \
      --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.2 \
      -- --device /dev/dri --device /dev/kfd \
      --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined
    

Swap in the image/tag for the backend you want to use.

Note:

  • --device /dev/dri provides graphics/video device nodes.
  • --device /dev/kfd is required for ROCm compute.
  • Extra groups (video, render, sudo) may be required for full access to GPU nodes and compute features, especially with ROCm.
  • Use --security-opt seccomp=unconfined to avoid seccomp sandbox issues (needed for some GPU syscalls).

Heres how you can integrate usage of the refresh script into your README, following the concise, direct style of the original:

2.1.1 Toolbox Refresh Script (Automatic Updates)

To pull the latest container images and recreate toolboxes cleanly, use the provided script:

📦 refresh-toolboxes.sh

./refresh-toolboxes.sh all

This will:

  1. Delete existing toolboxes (if any)
  2. Pull the latest images from DockerHub
  3. Recreate each toolbox with correct GPU access flags

You can also refresh just one or more toolboxes:

./refreshtoolboxes.sh llama-vulkan-amdvlk llama-rocm-6.4.2

2.2 Running models inside the toolboxes

Before running any commands, you must first enter your toolbox container shell using:

toolbox enter llama-vulkan-radv

This will drop you into a shell inside the toolbox, using your regular user account. The container shares your host home directory—so anything in your home is directly accessible (take care: your files are exposed and writable inside the toolbox!).

Once inside, the following commands show how to run local LLMs:

  • llama-cli --list-devices Lists available GPU devices for Llama.cpp.
  • llama-cli --no-mmap --ngl 999 -fa -m <model> Runs inference on the specified model, with all layers on GPU and flash attention enabled (replace ** with your model path).

2.3 Downloading GGUF Models from HuggingFace

Most Llama.cpp-compatible models are on HuggingFace. Filter for GGUF format, and try to pick Unsloth quantizations—they work great and are actively updated: https://huggingface.co/unsloth.

Download using the Hugging Face CLI. For example, to get the first shard of Qwen3 Coder 30B BF16 (https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF):

HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
  BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
  --local-dir models/qwen3-coder-30B-A3B/

HF_HUB_ENABLE_HF_TRANSFER=1 uses a Rust-based package that enables faster download (install from Pypi).

3. Performance Benchmarks (Key Results)

Got it — heres the concise, no-“we” version, with the table embedded and pointing to deeper analysis.


🔍 Key Findings from Benchmarks

Representative LLMs were tested on AMD Ryzen AI Max “Strix Halo” across all supported backends, using identical model builds in Llama.cpp.

PP = prompt processing (tokens/sec prefill), TG = token generation (tokens/sec interactive).

Model Vulkan (AMDVLK) Vulkan (RADV) ROCm 6.4.2 ROCm 6.4.2 + ROCWMMA ROCm 7.0 Beta ROCm 7.0 RC 🏆 Best PP 🏆 Best TG
Gemma3 12B Q8_0 677 pp / 14.0 tg 503 pp / 13.8 tg 223 pp / 13.8 tg 230 pp / 13.9 tg 223 pp / 13.9 tg 222 pp / 13.9 tg 🏆 AMDVLK 🏆 AMDVLK
Gemma3 27B BF16 ⚠️ Load Error 139 pp / 4.0 tg 84 pp / 4.0 tg 95 pp / 4.0 tg 92 pp / 4.0 tg 83 pp / 4.0 tg 🏆 RADV 🏆 ROCm6.4.2+ROCWMMA
Llama-4-Scout 17B Q8_0 260 pp / 12.2 tg 172 pp / 12.3 tg 135 pp / 11.6 tg ⚠️ GPU Hang ⚠️ GPU Hang ⚠️ Runtime Error 🏆 AMDVLK 🏆 RADV
Llama-4-Scout 17B Q4_K XL 221 pp / 20.0 tg 155 pp / 20.0 tg 138 pp / 17.4 tg ⚠️ GPU Hang 139 pp / 17.6 tg 124 pp / 17.6 tg 🏆 AMDVLK 🏆 AMDVLK
Qwen3 30B BF16 108 pp / 8.0 tg 87 pp / 7.4 tg 158 pp / 24.3 tg 162 pp / 24.5 tg 153 pp / 24.5 tg 152 pp / 24.6 tg 🏆 ROCm6.4.2+ROCWMMA 🏆 ROCm7 RC
Qwen3-235B Q3_K XL 116 pp / 16.0 tg 67 pp / 16.8 tg 74 pp / 13.7 tg ⚠️ GPU Hang ⚠️ GPU Hang ⚠️ Runtime Error 🏆 AMDVLK 🏆 RADV
GLM-4.5-Air-Q4_K_XL 202 pp / 22.8 tg 133 pp / 23.3 tg 130 pp / 19.4 tg ⚠️ GPU Hang ⚠️ GPU Hang 130 pp / 20.1 tg 🏆 AMDVLK 🏆 RADV
GLM-4.5-Air-Q6_K_XL 225 pp / 16.5 tg 132 pp / 17.0 tg 125 pp / 15.3 tg 114 pp / 15.5 tg 121 pp / 15.5 tg 124 pp / 15.5 tg 🏆 AMDVLK 🏆 RADV
gpt-oss-120b-mxfp4 546 pp / 48.1 tg 255 pp / 49.0 tg 353 pp / 44.1 tg 408 pp / 45.0 tg 355 pp / 45.0 tg 353 pp / 45.1 tg 🏆 AMDVLK 🏆 RADV
gpt-oss-20b-mxfp4 1473 pp / 68.8 tg 728 pp / 69.9 tg 583 pp / 64.5 tg 649 pp / 64.5 tg 584 pp / 64.4 tg 582 pp / 64.5 tg 🏆 AMDVLK 🏆 RADV

Observations:

  • AMDVLK (Vulkan) delivers the highest prompt processing speeds for most models, but is limited by ≤2 GiB single-buffer allocation and may fail to load some models.
  • RADV (Vulkan) is the most stable and compatible backend; typically slower than AMDVLK in PP but often competitive in TG.
  • ROCm 6.4.2 + ROCWMMA excels in BF16 workloads and can outperform Vulkan in certain cases, though ROCm stability issues remain.
  • ROCm 7.0 Beta/RC show similar performance to 6.4.2 without consistent gains.

📄 Full per-model analysis: docs/benchmarks.md 🌐 Interactive exploration: Live Benchmark Viewer

4. Memory Planning & VRAM Estimator

Running large language models locally requires estimating total VRAM required—not just for the model weights, but also for the "context" (number of active tokens) and extra overhead.

Use gguf-vram-estimator.py to check exactly how much memory you need for a given .gguf model and target context length. Example output:

$ gguf-vram-estimator.py models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf --contexts 4096 32768 1048576

--- Model 'Llama-4-Scout-17B-16E-Instruct' ---
Max Context: 10,485,760 tokens
Model Size: 57.74 GiB
Incl. Overhead: 2.00 GiB

--- Memory Footprint Estimation ---
   Context Size |  Context Memory | Est. Total VRAM
---------------------------------------------------
         4,096 |       1.88 GiB  |      61.62 GiB
        32,768 |      15.06 GiB  |      74.80 GiB
     1,048,576 |      49.12 GiB  |     108.87 GiB

With Q4_K quantization, Llama-4-Scout 17B can reach a 1M token context and still fit within a 128GB system, but... it will be extremely slow to process such a long context: see benchmarks (e.g. ~200 tokens/sec for prompt processing). Processing a 1M token context may take hours.

Contrast: Qwen3-235B Q3_K (quantized, 97GiB model):

$ gguf-vram-estimator.py models/qwen3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf --contexts 65536 131072 262144

--- Memory Footprint Estimation ---
   Context Size |  Context Memory | Est. Total VRAM
---------------------------------------------------
        65,536 |     11.75 GiB |     110.75 GiB
       131,072 |     23.50 GiB |     122.50 GiB
       262,144 |     47.00 GiB |     146.00 GiB

For Qwen3-235B, 128GB RAM allows you to run with context up to ~130k tokens.

  • The estimator lets you plan ahead and avoid out-of-memory errors when loading or using models.
  • For more examples and a breakdown of VRAM components, see docs/vram-estimator.md.

5. Building Containers Locally

Pre-built toolbox container images are published on Docker Hub for immediate use. If you wish to build the containers yourself (for example, to customize packages or rebuild with a different llama.cpp version), see:

Full instructions: docs/building.md.


6. Host Configuration

This should work on any Strix Halo. For a complete list of available hardware, see: Strix Halo Hardware Database

6.1 Test Configuration

Test Machine HP Z2 Mini G1a
CPU Ryzen AI MAX+ 395 "Strix Halo"
System Memory 128 GB RAM
GPU Memory 512 MB allocated in BIOS
Host OS Fedora 42, kernel 6.15.6-200.fc42.x86_86_64

6.2 Kernel Parameters (tested on Fedora 42)

Add these these boot parameters to enable unified memory and optimal performance:

amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432

Parameter Purpose
amd_iommu=off Disables IOMMU for lower latency
amdgpu.gttsize=131072 Enables unified GPU/system memory (up to 128 GiB); 131072 MiB ÷ 1024 = 128 GiB
ttm.pages_limit=33554432 Allows large pinned memory allocations; 33554432 × 4 KiB = 134217728 KiB ÷ 1024² = 128 GiB

Source: https://www.reddit.com/r/LocalLLaMA/comments/1m9wcdc/comment/n5gf53d/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button.

Apply the changes:

# Edit /etc/default/grub to add parameters to GRUB_CMDLINE_LINUX
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot

6.3 Ubuntu 24.04

Follow this guide by TechnigmaAI for a working configuration on Ubuntu 24.04:

https://github.com/technigmaai/technigmaai-wiki/wiki/AMD-Ryzen-AI-Max--395:-GTT--Memory-Step%E2%80%90by%E2%80%90Step-Instructions-(Ubuntu-24.04)

7. More Documentation

8. References

S
Description
No description provided
Readme 2 MiB
Languages
Python 72.4%
Shell 27.1%
C 0.5%