amd-strix-halo-toolboxes/README.md

# amd-strix-halo-toolboxes

This repository provides Fedora Rawhide-based containers for working with Ryzen AI MAX+ 395 **Strix Halo** chips with integrated GPU (gfx1151) and unified memory. The containers come pre-built with `llama.cpp` and all necessary GPU compute libraries.

## TL;DR - Performance Summary

After extensive testing, **Vulkan is currently the most stable and performant option** for Strix Halo GPUs:

| Backend | Status | Notes |
|---------|---------|-------|
| **Vulkan** | ✅ **Recommended** | Most stable, best performance across all model sizes |
| **ROCm 6.4.2** | ⚠️ Limited | Works ok, but extremely slow past 64GB memory allocations |
| **ROCm 7.0 beta** | ❌ Unstable | Frequent crashes under heavy load (llama-bench), basic usage possible |

## Available Containers

| Container | Backend | Status | Use Case |
|-----------|---------|---------|----------|
| `vulkan` | Vulkan compute | Stable | **Primary recommendation** |
| `rocm-6.4.2` | ROCm 6.4.2 (HIP) | Stable for <64GB models | Smaller models only |
| `rocm-7beta` | ROCm 7.0 beta (HIP) | Beta/Unstable | Testing only |

All containers include up-to-date libraries from Fedora Rawhide, except ROCm 7.0 beta which uses [official AMD RPMs](https://repo.radeon.com/rocm/el9/7.0_beta/main).

## Prerequisites

- [Podman](https://podman.io/) (or Docker with alias)
- [Toolbox](https://containertoolbx.org/)
- Linux kernel with AMD GPU (`amdgpu`) drivers
- AMD Strix Halo GPU with proper host configuration (see below)

## Quick Start

### 1. Pull Pre-built Images

```bash
# Recommended: Vulkan (most stable)
podman pull docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan

# Optional: ROCm variants for testing
podman pull docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.2
podman pull docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7beta
```

### 2. Create Toolboxes

**For Vulkan (Recommended):**
```bash
toolbox create llama-vulkan \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan \
  -- \
    --device /dev/dri \
    --group-add video \
    --security-opt seccomp=unconfined
```

**For ROCm 6.4.2:**
```bash
toolbox create llama-rocm-6.4.2 \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.2 \
  -- \
    --device /dev/kfd \
    --device /dev/dri \
    --group-add video \
    --security-opt seccomp=unconfined
```

**For ROCm 7.0 beta:**
```bash
toolbox create llama-rocm-7beta \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7beta \
  -- \
    --device /dev/kfd \
    --device /dev/dri \
    --group-add video \
    --security-opt seccomp=unconfined
```

> **Note:** The `--` separator passes the remaining flags to Podman/Docker for GPU access.

### 3. Enter and Test

**Test Vulkan container:**
```bash
toolbox enter llama-vulkan
vulkaninfo | head -n 10
llama-cli --list-devices
```

**Test ROCm containers:**
```bash
toolbox enter llama-rocm-6.4.2
llama-cli --list-devices
rocm-smi
```

## Performance Benchmarks

All benchmarks performed on HP Z2 Mini G1a with 128GB RAM, using `llama-bench` with all layers offloaded to GPU.

### Prompt Processing (pp512) - tokens/second

| Model | Size | Params | Vulkan | ROCm 6.4.2 | ROCm 7 Beta | Winner |
|-------|------|---------|---------|-------------|-------------|---------|
| **Gemma3 12B Q8_0** | 13.40 GiB | 11.77B | 509.45 ± 1.01 | 224.43 ± 0.26 | 219.55 ± 0.41 | 🏆 **Vulkan** (+132%) |
| **Qwen3 MoE 30B.A3B BF16** | 56.89 GiB | 30.53B | 74.62 ± 0.63 | 157.87 ± 2.71 | 155.37 ± 2.64 | 🏆 **ROCm 6.4.2** (+112%) |
| **Llama4 17Bx16E (Scout) Q4_K** | 57.73 GiB | 107.77B | 136.47 ± 1.52 | 132.61 ± 0.65 | ❌ GPU Hang | 🏆 **Vulkan** (+3%) |
| **Llama3.3 70B Q8_0** | 75.65 GiB | 70.55B | 76.51 ± 0.47 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |
| **Llama4 17Bx16E (Scout) Q6_K** | 82.35 GiB | 107.77B | 139.05 ± 0.79 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |
| **Qwen3 MoE 235B.A22B Q3_K** | 96.99 GiB | 235.09B | 59.12 ± 0.39 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |
| **Llama4 17Bx16E (Scout) Q8_0** | 106.65 GiB | 107.77B | 148.17 ± 2.99 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |

### Text Generation (tg128) - tokens/second

| Model | Size | Params | Vulkan | ROCm 6.4.2 | ROCm 7 Beta | Winner |
|-------|------|---------|---------|-------------|-------------|---------|
| **Gemma3 12B Q8_0** | 13.40 GiB | 11.77B | 13.67 ± 0.01 | 13.80 ± 0.00 | 13.43 ± 0.00 | 🏆 **ROCm 6.4.2** (+1%) |
| **Qwen3 MoE 30B.A3B BF16** | 56.89 GiB | 30.53B | 7.36 ± 0.00 | 23.67 ± 0.02 | 22.21 ± 0.00 | 🏆 **ROCm 6.4.2** (+222%) |
| **Llama4 17Bx16E (Scout) Q4_K** | 57.73 GiB | 107.77B | 20.05 ± 0.00 | 17.61 ± 0.00 | ❌ GPU Hang | 🏆 **Vulkan** (+14%) |
| **Llama3.3 70B Q8_0** | 75.65 GiB | 70.55B | 2.72 ± 0.00 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |
| **Llama4 17Bx16E (Scout) Q6_K** | 82.35 GiB | 107.77B | 15.22 ± 0.01 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |
| **Qwen3 MoE 235B.A22B Q3_K** | 96.99 GiB | 235.09B | 15.97 ± 0.02 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |
| **Llama4 17Bx16E (Scout) Q8_0** | 106.65 GiB | 107.77B | 12.22 ± 0.01 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |

### Performance Summary

**🏆 Vulkan Advantages:**
- Consistently stable across all model sizes
- Significantly better prompt processing on smaller quantized models (127% faster on Gemma3 12B)
- Only option that can handle >64GB models efficiently
- Moderate advantage on larger quantized models (3-14% better on Llama4 17B)

**🏆 ROCm 6.4.2 Advantages:**
- **Dramatically superior performance on BF16 models** (112% faster prompt processing, 222% faster text generation on Qwen3 MoE 30B)
- Optimized native floating-point operations through HIP compute
- Better suited for models using native precision formats

**📊 Performance by Model Type:**
- **BF16/Native Precision Models**: ROCm 6.4.2 is the clear winner with 2-3x better performance
- **Small Quantized Models**: Vulkan has significant advantages for prompt processing
- **Large Quantized Models**: Performance is similar between backends (differences within noise)
- **Large Models (>64GB)**: Vulkan is the only viable option due to ROCm's memory allocation issues

**❌ ROCm 6.4.2 Limitations:**
- Extremely slow memory loading for models >64GB (unusable)
- Performance advantage limited to BF16/native precision models

**❌ ROCm 7.0 Beta Issues:**
- GPU hangs/crashes on larger models (Llama4 17B causes "GPU Hang" and core dump)
- Similar slow loading issues as ROCm 6.4.2 for models >64GB
- Performance similar to ROCm 6.4.2 when it works, but reliability is poor
- Uses [official AMD RPMs](https://repo.radeon.com/rocm/el9/7.0_beta/main) (beta quality)

**💡 Recommendation Strategy:**
- Use **ROCm 6.4.2** for BF16/native precision models under 64GB
- Use **Vulkan** for quantized models (especially smaller ones) and all models over 64GB
- For large quantized models under 64GB, either backend performs similarly
- Avoid ROCm 7.0 beta for production workloads


## VRAM Planning with `gguf-vram-estimator.py`

### Why Model File Size Isn't the Whole Story

A model's VRAM footprint has three main components:

1.  **Model Weights:** The static size of the model on disk.
2.  **Context Memory (KV Cache):** A dynamic buffer that grows linearly with the number of tokens in your context. For every token processed, a Key/Value state is stored in VRAM. This is often the largest variable consumer of memory.
3.  **Overhead:** A semi-fixed amount of memory for compute buffers, drivers, and other scratchpads. This can be 1-3 GiB or more.

The total VRAM required is `Model Size + Context Memory + Overhead`. To run a model, you must have enough memory for all three.

### The `gguf-vram-estimator.py` Utility

To help plan your workload, this repository includes the `gguf-vram-estimator.py` utility. It inspects a GGUF file and calculates the total VRAM needed for different context lengths, allowing you to make informed decisions about which model quantization and context size will fit in your system's memory.

#### How to Use

The script is included in the container. Run it by pointing it at the first part of any GGUF model:

```bash
# Syntax
gguf-vram-estimator.py <path-to-gguf-file> [options]
```
**Key Options:**
- `--contexts`: A space-separated list of context sizes to calculate (e.g., `--contexts 4096 16384`).
- `--overhead`: A float value to set the estimated overhead in GiB (default: `2.0`).

### Practical Examples: Planning for a 128GB Strix Halo System

The key to using a unified memory system is balancing model quality (quantization) against context length.

#### Scenario 1: High Quality, Short Context (Coding & Chat)

You need the highest precision for tasks that don't require massive context windows.

**Goal:** Run the highest quality `Llama-4-Scout` model (`Q8_0`) with a standard 8k-16k context.

```bash
gguf-vram-estimator.py models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf
```
```
--- Model 'Llama-4-Scout-17B-16E-Instruct' ---
Max Context: 10,485,760 tokens
Model Size: 106.67 GiB (from file size)
Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)

--- Memory Footprint Estimation ---
   Context Size |  Context Memory | Est. Total VRAM
---------------------------------------------------
          4,096 |      768.00 MiB |      109.42 GiB
          8,192 |        1.50 GiB |      110.17 GiB
         16,384 |        1.88 GiB |      110.54 GiB
```
**Analysis:** The `Q8_0` model consumes **106.7 GiB**. A 16k context adds another **~1.9 GiB**, for a total of **~111 GiB**. This fits comfortably within a 128GB system.

#### Scenario 2: Massive Context, Lower Precision (RAG & Document Analysis)

You need to process a huge amount of text and are willing to trade some precision for a massive context window.

**Goal:** Run `Llama-4-Scout` with a 1,000,000 token context. This requires a much smaller model quantization (`Q4_K_XL`).

```bash
gguf-vram-estimator.py models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf
```
```
--- Model 'Llama-4-Scout-17B-16E-Instruct' ---
Max Context: 10,485,760 tokens
Model Size: 57.74 GiB (from file size)
Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)

--- Memory Footprint Estimation ---
   Context Size |  Context Memory | Est. Total VRAM
---------------------------------------------------
        524,288 |       25.12 GiB |       84.87 GiB
      1,048,576 |       49.12 GiB |      108.87 GiB
```
**Analysis:** To enable this, we use a `Q4_K_XL` model that is only **57.7 GiB**. The 1M token context adds a massive **49.1 GiB** of memory. The total, **~109 GiB**, is a tight but achievable fit on a 128GB system.

#### Scenario 3: Fitting a Very Large Model

**Goal:** Determine the maximum viable context for the huge `Qwen3-235B` model.

```bash
gguf-vram-estimator.py models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf
```
```
--- Model 'Qwen3-235B-A22B-Instruct-2507' ---
Max Context: 262,144 tokens
Model Size: 97.00 GiB (from file size)
Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)

--- Memory Footprint Estimation ---
   Context Size |  Context Memory | Est. Total VRAM
---------------------------------------------------
         65,536 |       11.75 GiB |      110.75 GiB
        131,072 |       23.50 GiB |      122.50 GiB
        262,144 |       47.00 GiB |      146.00 GiB
```
**Analysis:** The base model takes **97 GiB**. You have approximately **30 GiB** of headroom. This allows for a very large context of **~131k tokens** before exceeding the system's 128GB capacity. Attempting the full 262k context would require `146 GiB` and fail.

> **Key Takeaway:** This tool is essential for balancing **Model Quality (Quantization)** vs. **Context Length** to fit your specific task within your system's VRAM limits.


## Building Containers Locally (Optional)

If you prefer to build the containers yourself:

```bash
# Build all variants
podman build -t localhost/llama-vulkan -f Dockerfile.vulkan .
podman build -t localhost/llama-rocm-6.4.2 -f Dockerfile.rocm-6.4.2 .
podman build -t localhost/llama-rocm-7beta -f Dockerfile.rocm-7beta .
```

### Create Toolboxes from Local Images

```bash
# Using locally built images
toolbox create llama-vulkan-local \
  --image localhost/llama-vulkan \
  -- \
    --device /dev/dri \
    --group-add video \
    --security-opt seccomp=unconfined

toolbox create llama-rocm-local \
  --image localhost/llama-rocm-6.4.2 \
  -- \
    --device /dev/kfd \
    --device /dev/dri \
    --group-add video \
    --security-opt seccomp=unconfined
```

## Host Configuration

This should work on any Strix Halo device. For a complete list of available hardware, see: [Strix Halo Hardware Database](https://strixhalo-homelab.d7.wtf/Hardware)

### My Test Configuration
| Component | Specification |
|-----------|---------------|
| **Test Machine** | HP Z2 Mini G1a |
| **CPU** | Ryzen AI MAX+ 395 "Strix Halo" |
| **System Memory** | 128 GB RAM |
| **GPU Memory** | 512 MB allocated in BIOS |
| **Host OS** | Fedora 42, kernel 6.15.6-200.fc42.x86_64 |

### Kernel Parameters

Add these boot parameters to enable unified memory and optimal performance:

```
amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=335544321
```

| Parameter | Purpose |
|-----------|---------|
| `amd_iommu=off` | Disables IOMMU for lower latency |
| `amdgpu.gttsize=131072` | Enables unified GPU/system memory (up to 128 GB) |
| `ttm.pages_limit=335544321` | Allows large pinned memory allocations |

**Apply the changes:**
```bash
# Edit /etc/default/grub to add parameters to GRUB_CMDLINE_LINUX
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot
```