T

dougs f28dee87ef Update README with correction to model download and inference instructions (#54 )

Updated instructions for downloading model files to include both parts of the example GGUF

2026-02-05 11:21:47 +00:00

.github/workflows

chore: deprecate and remove ROCm 7.1.1 toolbox and all associated references.

2026-02-04 17:56:41 +00:00

benchmark

chore: deprecate and remove ROCm 7.1.1 toolbox and all associated references.

2026-02-04 17:56:41 +00:00

docs

updated gpt-oss benchmakrs to test rocm7 performance patch

2026-02-04 17:46:43 +00:00

scripts

chore: deprecate and remove ROCm 7.1.1 toolbox and all associated references.

2026-02-04 17:56:41 +00:00

toolboxes

chore: deprecate and remove ROCm 7.1.1 toolbox and all associated references.

2026-02-04 17:56:41 +00:00

.gitignore

updated benchmarks

2025-11-17 23:02:56 +00:00

README.md

Update README with correction to model download and inference instructions (#54 )

2026-02-05 11:21:47 +00:00

refresh-toolboxes.sh

chore: deprecate and remove ROCm 7.1.1 toolbox and all associated references.

2026-02-04 17:56:41 +00:00

README.md

AMD Strix Halo Llama.cpp Toolboxes

This project provides pre-built containers (“toolboxes”) for running LLMs on AMD Ryzen AI Max “Strix Halo” integrated GPUs. Toolbx is the standard developer container system in Fedora (and now works on Ubuntu, openSUSE, Arch, etc).

📺 Video Demo

Stable Configuration
ROCm 7 Performance Regression Workaround
Supported Toolboxes
Quick Start
Host Configuration
Performance Benchmarks
Memory Planning & VRAM Estimator
Building Locally
Distributed Inference
More Documentation
References

✅ Stable Configuration

OS: Fedora 42/43
Linux Kernel: 6.18.6-200
Linux Firmware: 20260110

This is currently the most stable setup. Kernels older than 6.18.4 have a bug that causes stability issues on gfx1151 and should be avoided. Also, do NOT use linux-firmware-20251125. It breaks ROCm support on Strix Halo (instability/crashes).

⚠️ Important: See Host Configuration for critical kernel parameters.

✅ ROCm 7 Performance Regression Workaround Applied — 2026-02-04

The performance regression previously observed in ROCm 7+ builds (compared to ROCm 6.4.4) has been resolved in the toolboxes via a workaround.

The issue was caused by a compiler regression (llvm/llvm-project#147700) affecting loop unrolling thresholds. We have applied the workaround (-mllvm --amdgpu-unroll-threshold-local=600) in the latest toolbox builds, restoring full performance.

This workaround will be removed once the upstream fix lands. For details, see the issue: kyuz0/amd-strix-halo-toolboxes#45

📦 Supported Toolboxes

You can check the containers on DockerHub: kyuz0/amd-strix-halo-toolboxes.

Container Tag	Backend/Stack	Purpose / Notes
`vulkan-amdvlk`	Vulkan (AMDVLK)	Fastest backend—AMD open-source driver. ≤2 GiB single buffer allocation limit, some large models won't load.
`vulkan-radv`	Vulkan (Mesa RADV)	Most stable and compatible. Recommended for most users and all models.
`rocm-6.4.4`	ROCm 6.4.4 (Fedora 43)	Latest stable 6.x build. Uses Fedora 43 packages with backported patch for kernel 6.18.4+ support.
`rocm-7.2`	ROCm 7.2	Latest stable 7.x build. Includes patch for kernel 6.18.4+ support.
`rocm7-nightlies`	ROCm 7 Nightly	Tracks nightly builds. Includes patch for kernel 6.18.4+ support.

These containers are automatically rebuilt whenever the Llama.cpp master branch is updated. Legacy images (rocm-6.4.2, rocm-6.4.3, rocm-7.1.1) are excluded from this list.

🚀 Quick Start

1. Create & Enter Toolbox

Option A: Vulkan (RADV/AMDVLK) - best for compatibility

toolbox create llama-vulkan-radv \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
  -- --device /dev/dri --group-add video --security-opt seccomp=unconfined

toolbox enter llama-vulkan-radv

Option B: ROCm (Recommended for Performance)

toolbox create llama-rocm-7.2 \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7.2 \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined

toolbox enter llama-rocm-7.2

(Ubuntu users: use Distrobox as toolbox may break GPU access).

2. Check GPU Access

Inside the toolbox:

llama-cli --list-devices

3. Download Model

Example: Qwen3 Coder 30B (BF16)

HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
  BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
  --local-dir models/qwen3-coder-30B-A3B/

HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
  BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00002-of-00002.gguf \
  --local-dir models/qwen3-coder-30B-A3B/

4. Run Inference

⚠️ IMPORTANT: Always use flash attention (-fa 1) and no-mmap (--no-mmap) on Strix Halo to avoid crashes/slowdowns.

Server Mode (API):

llama-server -m models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
  -c 8192 -ngl 999 -fa 1 --no-mmap

CLI Mode:

llama-cli --no-mmap -ngl 999 -fa 1 \
  -m models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
  -p "Write a Strix Halo toolkit haiku."

5. Keep Updated

Refresh your authenticated toolboxes to the latest nightly/stable builds:

./refresh-toolboxes.sh all

⚙️ Host Configuration

This should work on any Strix Halo. For a complete list of available hardware, see: Strix Halo Hardware Database

Test Configuration

Component	Specification
Test Machine	Framework Desktop
CPU	Ryzen AI MAX+ 395 "Strix Halo"
System Memory	128 GB RAM
GPU Memory	512 MB allocated in BIOS
Host OS	Fedora 43, Linux 6.18.5-200.fc43.x86_64

Kernel Parameters (tested on Fedora 42)

Add these boot parameters to enable unified memory while reserving a minimum of 4 GiB for the OS (max 124 GiB for iGPU):

iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856

Parameter	Purpose
`iommu=pt`	Sets IOMMU to "Pass-Through" mode. This helps performance, reducing overhead for the iGPU unified memory access.
`amdgpu.gttsize=126976`	Caps GPU unified memory to 124 GiB; 126976 MiB ÷ 1024 = 124 GiB
`ttm.pages_limit=32505856`	Caps pinned memory to 124 GiB; 32505856 × 4 KiB = 126976 MiB = 124 GiB

Apply with:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot

Ubuntu 24.04

See TechnigmaAI's Guide.

📊 Performance Benchmarks

🌐 Interactive Viewer: https://kyuz0.github.io/amd-strix-halo-toolboxes/

See docs/benchmarks.md for full logs.

💾 Memory Planning & VRAM Estimator

Strix Halo uses unified memory. To estimate VRAM requirements for models (including context overhead), use the included tool:

gguf-vram-estimator.py models/my-model.gguf --contexts 32768

See docs/vram-estimator.md for details.

🛠️ Building Locally

You can build the containers yourself to customize packages or llama.cpp versions. Instructions: docs/building.md.

🌩️ Distributed Inference

Run models across a cluster of Strix Halo machines using run_distributed_llama.py.

Setup SSH keys between nodes.
Run python3 run_distributed_llama.py on the main node.
Follow the TUI to launch the cluster.

README.md

AMD Strix Halo Llama.cpp Toolboxes

📺 Video Demo

Table of Contents

✅ Stable Configuration

✅ ROCm 7 Performance Regression Workaround Applied — 2026-02-04

📦 Supported Toolboxes

🚀 Quick Start

1. Create & Enter Toolbox

2. Check GPU Access

3. Download Model

4. Run Inference

5. Keep Updated

⚙️ Host Configuration

Test Configuration

Kernel Parameters (tested on Fedora 42)

Ubuntu 24.04

📊 Performance Benchmarks

💾 Memory Planning & VRAM Estimator

🛠️ Building Locally

🌩️ Distributed Inference

📚 More Documentation

🔗 References

README.md Unescape Escape

AMD Strix Halo Llama.cpp Toolboxes

📺 Video Demo

Table of Contents

✅ Stable Configuration

✅ ROCm 7 Performance Regression Workaround Applied — 2026-02-04

📦 Supported Toolboxes

🚀 Quick Start

1. Create & Enter Toolbox

2. Check GPU Access

3. Download Model

4. Run Inference

5. Keep Updated

⚙️ Host Configuration

Test Configuration

Kernel Parameters (tested on Fedora 42)

Ubuntu 24.04

📊 Performance Benchmarks

💾 Memory Planning & VRAM Estimator

🛠️ Building Locally

🌩️ Distributed Inference

📚 More Documentation

🔗 References

README.md