Benchmark and container updates
This commit is contained in:
@@ -1,307 +1,192 @@
|
||||
# amd-strix-halo-toolboxes
|
||||
# AMD Strix Halo Llama.cpp Toolboxes
|
||||
|
||||
Fedora Rawhide-based containers for AMD Ryzen AI MAX+ 395 **Strix Halo** chips with integrated GPU (gfx1151) and unified memory. Pre-built with `llama.cpp` and GPU compute libraries.
|
||||
This project provides pre-built containers (“toolboxes”) for running LLMs on **AMD Ryzen AI Max “Strix Halo”** integrated GPUs. Toolbx is the standard developer container system in Fedora (and now works on Ubuntu, openSUSE, Arch, etc).
|
||||
|
||||
## Why Toolbx?
|
||||
|
||||
* Reproducible: never pollute your host system
|
||||
* Seamless: shares your home and GPU devices, works like a native shell
|
||||
* Flexible: easy to switch between Vulkan (open/closed drivers) and ROCm
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [1. Performance Summary](#1-performance-summary)
|
||||
- [2. Available Containers](#2-available-containers)
|
||||
- [3. Quick Start](#3-quick-start)
|
||||
- [3.1 Prerequisites](#31-prerequisites)
|
||||
- [3.2 Pull Pre-built Images](#32-pull-pre-built-images)
|
||||
- [3.3 Create Toolboxes](#33-create-toolboxes)
|
||||
- [3.4 Enter and Test](#34-enter-and-test)
|
||||
- [4. Performance Benchmarks](#4-performance-benchmarks)
|
||||
- [4.1 Prompt Processing Results](#41-prompt-processing-pp512---tokenssecond)
|
||||
- [4.2 Text Generation Results](#42-text-generation-tg128---tokenssecond)
|
||||
- [4.3 Performance Analysis](#43-performance-analysis)
|
||||
- [5. Memory Planning](#5-memory-planning)
|
||||
- [5.1 VRAM Estimation Tool](#51-the-gguf-vram-estimatorpy-utility)
|
||||
- [5.2 Usage Examples](#52-practical-examples-planning-for-a-128gb-strix-halo-system)
|
||||
- [6. Building Locally](#6-building-containers-locally-optional)
|
||||
- [7. Host Configuration](#7-host-configuration)
|
||||
- [7.1 Test Configuration](#71-test-configuration)
|
||||
- [7.2 Kernel Parameters (tested on Fedora 42)](#72-kernel-parameters-tested-on-fedora-42)
|
||||
- [7.3 Ubuntu 24.04](#73-ubuntu-2404)
|
||||
1. [Llama.cpp Compiled for Every Backend](#1-llamacpp-compiled-for-every-backend)
|
||||
1.1 [Supported Container Images](#11-supported-container-images)
|
||||
2. [Quickest Usage Example](#2-quickest-usage-example)
|
||||
2.1 [Creating the toolboxes with GPU access](#21-creating-the-toolboxes-with-gpu-access)
|
||||
2.2 [Running models inside the toolboxes](#22-running-models-inside-the-toolboxes)
|
||||
3. [Performance Benchmarks (Key Results)](#3-performance-benchmarks-key-results)
|
||||
4. [Memory Planning & VRAM Estimator](#4-memory-planning--vram-estimator)
|
||||
5. [Building Containers Locally](#5-building-containers-locally)
|
||||
6. [Host Configuration](#6-host-configuration)
|
||||
6.1 [Test Configuration](#61-test-configuration)
|
||||
6.2 [Kernel Parameters (tested on Fedora 42)](#62-kernel-parameters-tested-on-fedora-42)
|
||||
6.3 [Ubuntu 24.04](#63-ubuntu-2404)
|
||||
7. [More Documentation](#7-more-documentation)
|
||||
8. [References](#8-references)
|
||||
|
||||
## 1. Llama.cpp Compiled for Every Backend
|
||||
|
||||
## 1. Performance Summary
|
||||
This project uses [Llama.cpp](https://github.com/ggerganov/llama.cpp), a high-performance inference engine for running local LLMs (large language models) on CPUs and GPUs. Llama.cpp is open source, extremely fast, and is the only engine supporting all key backends for AMD Strix Halo: Vulkan (RADV, AMDVLK) and ROCm/HIP
|
||||
|
||||
**Vulkan is currently the most stable and performant option** for Strix Halo GPUs:
|
||||
* **Vulkan** is a cross-platform, low-level graphics and compute API. Llama.cpp can use Vulkan for GPU inference with either the open Mesa RADV driver or AMD's "official" open AMDVLK driver. This is the most stable and supported option for AMD CPUs at the moment.
|
||||
* **ROCm** is AMD's open-source answer to CUDA: a GPU compute stack for machine learning and HPC. With ROCm, you can run Llama.cpp on AMD GPUs in a way similar to how CUDA works on NVIDIA - this is not the most stable/mature, but recently it's been getting better.
|
||||
|
||||
| Backend | Status | Notes |
|
||||
|---------|---------|-------|
|
||||
| **Vulkan** | ✅ **Recommended** | Most stable, best performance across all model sizes |
|
||||
| **ROCm 6.4.2** | ⚠️ Limited | Works ok, but extremely slow past 64GB memory allocations [GitHub Issue #15018](https://github.com/ggml-org/llama.cpp/issues/15018) |
|
||||
| **ROCm 7.0 beta** | ❌ Unstable | Frequent crashes under heavy load (llama-bench), basic usage possible |
|
||||
### 1.1 Supported Container Images
|
||||
|
||||
## 2. Available Containers
|
||||
| Container Tag | Backend/Stack | Purpose / Notes |
|
||||
| --------------- | ------------------- | ------------------------------------------------------------------------------------- |
|
||||
| `vulkan-amdvlk` | Vulkan (AMDVLK) | Fastest backend—use if model loads. AMD Open Source driver. May fail on >40 GiB/BF16. |
|
||||
| `vulkan-radv` | Vulkan (Mesa RADV) | Most stable/compatible. Recommended for most users and all models. |
|
||||
| `rocm-6.4.2` | ROCm 6.4.2 (HIP) | Stable for smaller (<40 GiB) and BF16 models. Crashes on larger quantized models. |
|
||||
| `rocm-7beta` | ROCm 7.0 Beta (HIP) | Latest ROCm beta. No real improvement for Llama.cpp. Similar model limits to 6.4.2. |
|
||||
| `rocm-7rc` | ROCm 7.0 RC (HIP) | Release candidate ROCm 7.0. Same behavior as above. |
|
||||
|
||||
| Container | Backend | Status | Use Case |
|
||||
|-----------|---------|---------|----------|
|
||||
| `vulkan` | Vulkan compute | Stable | **Primary recommendation** |
|
||||
| `rocm-6.4.2` | ROCm 6.4.2 (HIP) | Stable for <64GB models | Smaller models only |
|
||||
| `rocm-7beta` | ROCm 7.0 beta (HIP) | Beta/Unstable | Testing only |
|
||||
> *Each container is based on Fedora Rawhide and is built for maximum compatibility and performance on Strix Halo.*
|
||||
|
||||
All containers include up-to-date libraries from Fedora Rawhide, except ROCm 7.0 beta which uses [official AMD RPMs](https://repo.radeon.com/rocm/el9/7.0_beta/main).
|
||||
---
|
||||
|
||||
## 3. Quick Start
|
||||
## 2. Quickest Usage Example
|
||||
|
||||
### 3.1 Prerequisites
|
||||
### 2.1 Creating the toolboxes with GPU access
|
||||
|
||||
- [Podman](https://podman.io/) (or Docker with alias)
|
||||
- [Toolbox](https://containertoolbx.org/)
|
||||
- Linux kernel with AMD GPU (`amdgpu`) drivers
|
||||
- AMD Strix Halo GPU with proper host configuration (see [7. Host Configuration](#7-host-configuration))
|
||||
To use Llama.cpp with hardware acceleration inside a toolbox container, you must expose the GPU devices from your host. The exact flags and devices depend on the backend:
|
||||
|
||||
### 3.2 Pull Pre-built Images
|
||||
* **For Vulkan (RADV/AMDVLK):** Only `/dev/dri` is required.
|
||||
*Add the user to the video group for access to GPU devices.*
|
||||
|
||||
```bash
|
||||
# Recommended: Vulkan (most stable)
|
||||
podman pull docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan
|
||||
```sh
|
||||
toolbox create llama-vulkan-radv \
|
||||
--image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
|
||||
-- --device /dev/dri --group-add video --security-opt seccomp=unconfined
|
||||
```
|
||||
|
||||
# Optional: ROCm variants for testing
|
||||
podman pull docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.2
|
||||
podman pull docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7beta
|
||||
* **For ROCm:** You must expose both `/dev/dri` and `/dev/kfd` (and sometimes `/dev/hsa` for older hardware), and add the user to extra groups for compute access.
|
||||
|
||||
```sh
|
||||
toolbox create llama-rocm-6.4.2 \
|
||||
--image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.2 \
|
||||
-- --device /dev/dri --device /dev/kfd \
|
||||
--group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined
|
||||
```
|
||||
|
||||
*Swap in the image/tag for the backend you want to use.*
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> * `--device /dev/dri` provides graphics/video device nodes.
|
||||
> * `--device /dev/kfd` is required for ROCm compute.
|
||||
> * Extra groups (`video`, `render`, `sudo`) may be required for full access to GPU nodes and compute features, especially with ROCm.
|
||||
> * Use `--security-opt seccomp=unconfined` to avoid seccomp sandbox issues (needed for some GPU syscalls).
|
||||
|
||||
### 2.2 Running models inside the toolboxes 
|
||||
|
||||
Before running any commands, you must first enter your toolbox container shell using:
|
||||
|
||||
```sh
|
||||
toolbox enter llama-vulkan-radv
|
||||
```
|
||||
|
||||
### 3.3 Create Toolboxes
|
||||
*This will drop you into a shell inside the toolbox, using your regular user account. The container shares your host home directory—so anything in your home is directly accessible (take care: your files are exposed and writable inside the toolbox!).*
|
||||
|
||||
Once inside, the following commands show how to run local LLMs:
|
||||
|
||||
* `llama-cli --list-devices`
|
||||
*Lists available GPU devices for Llama.cpp.*
|
||||
* `llama-cli --no-mmap --ngl 999 -fa -m <model>`
|
||||
*Runs inference on the specified model, with all layers on GPU and flash attention enabled (replace \*\* with your model path).*
|
||||
|
||||
## 3. Performance Benchmarks (Key Results)
|
||||
|
||||
Below are some results from real runs on Strix Halo hardware of `llama-bench`. For full tables and model-by-model breakdowns (including both prompt processing and token generation speeds), see docs/benchmarks.md.
|
||||
|
||||
| Model | Vulkan (AMDVLK) | Vulkan (RADV) | ROCm 6.4.2 | ROCm 7.0 Beta | ROCm 7.0 RC | 🏆 Best PP | 🏆 Best TG |
|
||||
| ------------------------------ | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ------------- | -------------- |
|
||||
| **Gemma3 12B Q8\_0** | 686 pp / 13.9 tg | 509 pp / 13.7 tg | 223 pp / 13.8 tg | 223 pp / 13.8 tg | 223 pp / 13.8 tg | **AMDVLK** | **AMDVLK** |
|
||||
| **Gemma3 27B BF16** | ❌ Crash | 135 pp / 4.0 tg | 89 pp / 4.0 tg | 82 pp / 4.0 tg | 83 pp / 4.0 tg | **RADV** | **ROCm6.4.2** |
|
||||
| **Llama-4-Scout 17B Q8\_0** | 241 pp / 12.3 tg | 146 pp / 12.3 tg | ❌ Crash | ❌ Crash | ❌ Crash | **AMDVLK** | **AMDVLK** |
|
||||
| **Llama-4-Scout 17B Q4\_K XL** | 209 pp / 20.1 tg | 133 pp / 20.0 tg | 133 pp / 17.3 tg | 134 pp / 17.3 tg | ❌ Crash | **AMDVLK** | **AMDVLK** |
|
||||
| **Qwen3 MoE 30B BF16** | 90 pp / 8.0 tg | 71 pp / 7.3 tg | 158 pp / 22.9 tg | 151 pp / 23.8 tg | 155 pp / 23.1 tg | **ROCm6.4.2** | **ROCm7 Beta** |
|
||||
| **Qwen3-235B Q3\_K XL** | 99 pp / 15.7 tg | 58 pp / 16.3 tg | 69 pp / 13.5 tg | ❌ Crash | 75 pp / 13.6 tg | **AMDVLK** | **RADV** |
|
||||
|
||||
|
||||
* **pp = tokens/sec, prompt processing (pre-fill, max speed)**
|
||||
* **tg = tokens/sec, generation (interactive, single token at a time)**
|
||||
* 🏆 denotes the winner
|
||||
|
||||
**Takeaways:**
|
||||
|
||||
* **Vulkan AMDVLK** is the fastest—when it works. May crash on large or BF16 models.
|
||||
* **Vulkan RADV** is the most stable and compatible (recommended for most usage).
|
||||
* **ROCm** is only superior on BF16 models, otherwise less stable and may crash or hang.
|
||||
|
||||
## 4. Memory Planning & VRAM Estimator
|
||||
|
||||
Running large language models locally requires estimating **total VRAM required**—not just for the model weights, but also for the "context" (number of active tokens) and extra overhead.
|
||||
|
||||
Use `gguf-vram-estimator.py` to check exactly how much memory you need for a given `.gguf` model and target context length. Example output:
|
||||
|
||||
**For Vulkan (Recommended):**
|
||||
```bash
|
||||
toolbox create llama-vulkan \
|
||||
--image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan \
|
||||
-- \
|
||||
--device /dev/dri \
|
||||
--group-add video \
|
||||
--security-opt seccomp=unconfined
|
||||
```
|
||||
$ gguf-vram-estimator.py models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf --contexts 4096 32768 1048576
|
||||
|
||||
**For ROCm 6.4.2:**
|
||||
```bash
|
||||
toolbox create llama-rocm-6.4.2 \
|
||||
--image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.2 \
|
||||
-- \
|
||||
--device /dev/kfd \
|
||||
--device /dev/dri \
|
||||
--group-add video \
|
||||
--security-opt seccomp=unconfined
|
||||
```
|
||||
|
||||
**For ROCm 7.0 beta:**
|
||||
```bash
|
||||
toolbox create llama-rocm-7beta \
|
||||
--image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7beta \
|
||||
-- \
|
||||
--device /dev/kfd \
|
||||
--device /dev/dri \
|
||||
--group-add video \
|
||||
--security-opt seccomp=unconfined
|
||||
```
|
||||
|
||||
> **Note:** The `--` separator passes the remaining flags to Podman/Docker for GPU access.
|
||||
|
||||
### 3.4 Enter and Test
|
||||
|
||||
**Test Vulkan container:**
|
||||
```bash
|
||||
toolbox enter llama-vulkan
|
||||
vulkaninfo | head -n 10
|
||||
llama-cli --list-devices
|
||||
```
|
||||
|
||||
**Test ROCm containers:**
|
||||
```bash
|
||||
toolbox enter llama-rocm-6.4.2
|
||||
llama-cli --list-devices
|
||||
rocm-smi
|
||||
```
|
||||
|
||||
## 4. Performance Benchmarks
|
||||
|
||||
All benchmarks performed on HP Z2 Mini G1a with 128GB RAM, using `llama-bench` with all layers offloaded to GPU.
|
||||
|
||||
### 4.1 Prompt Processing (pp512) - tokens/second
|
||||
|
||||
| Model | Size | Params | Vulkan | ROCm 6.4.2 | ROCm 7 Beta | Winner |
|
||||
|-------|------|---------|---------|-------------|-------------|---------|
|
||||
| **Gemma3 12B Q8_0** | 13.40 GiB | 11.77B | 509.45 ± 1.01 | 224.43 ± 0.26 | 219.55 ± 0.41 | 🏆 **Vulkan** (+132%) |
|
||||
| **Qwen3 MoE 30B.A3B BF16** | 56.89 GiB | 30.53B | 74.62 ± 0.63 | 157.87 ± 2.71 | 155.37 ± 2.64 | 🏆 **ROCm 6.4.2** (+112%) |
|
||||
| **Llama4 17Bx16E (Scout) Q4_K** | 57.73 GiB | 107.77B | 136.47 ± 1.52 | 132.61 ± 0.65 | ❌ GPU Hang | 🏆 **Vulkan** (+3%) |
|
||||
| **Llama3.3 70B Q8_0** | 75.65 GiB | 70.55B | 76.51 ± 0.47 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |
|
||||
| **Llama4 17Bx16E (Scout) Q6_K** | 82.35 GiB | 107.77B | 139.05 ± 0.79 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |
|
||||
| **Qwen3 MoE 235B.A22B Q3_K** | 96.99 GiB | 235.09B | 59.12 ± 0.39 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |
|
||||
| **Llama4 17Bx16E (Scout) Q8_0** | 106.65 GiB | 107.77B | 148.17 ± 2.99 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |
|
||||
|
||||
### 4.2 Text Generation (tg128) - tokens/second
|
||||
|
||||
| Model | Size | Params | Vulkan | ROCm 6.4.2 | ROCm 7 Beta | Winner |
|
||||
|-------|------|---------|---------|-------------|-------------|---------|
|
||||
| **Gemma3 12B Q8_0** | 13.40 GiB | 11.77B | 13.67 ± 0.01 | 13.80 ± 0.00 | 13.43 ± 0.00 | 🏆 **ROCm 6.4.2** (+1%) |
|
||||
| **Qwen3 MoE 30B.A3B BF16** | 56.89 GiB | 30.53B | 7.36 ± 0.00 | 23.67 ± 0.02 | 22.21 ± 0.00 | 🏆 **ROCm 6.4.2** (+222%) |
|
||||
| **Llama4 17Bx16E (Scout) Q4_K** | 57.73 GiB | 107.77B | 20.05 ± 0.00 | 17.61 ± 0.00 | ❌ GPU Hang | 🏆 **Vulkan** (+14%) |
|
||||
| **Llama3.3 70B Q8_0** | 75.65 GiB | 70.55B | 2.72 ± 0.00 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |
|
||||
| **Llama4 17Bx16E (Scout) Q6_K** | 82.35 GiB | 107.77B | 15.22 ± 0.01 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |
|
||||
| **Qwen3 MoE 235B.A22B Q3_K** | 96.99 GiB | 235.09B | 15.97 ± 0.02 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |
|
||||
| **Llama4 17Bx16E (Scout) Q8_0** | 106.65 GiB | 107.77B | 12.22 ± 0.01 | ⚠️ Too slow | ⚠️ Too slow | 🏆 **Vulkan only** |
|
||||
|
||||
### 4.3 Performance Analysis
|
||||
|
||||
**🏆 Vulkan Advantages:**
|
||||
- Consistently stable across all model sizes
|
||||
- Significantly better prompt processing on smaller quantized models (127% faster on Gemma3 12B)
|
||||
- Only option that can handle >64GB models efficiently
|
||||
- Moderate advantage on larger quantized models (3-14% better on Llama4 17B)
|
||||
|
||||
**🏆 ROCm 6.4.2 Advantages:**
|
||||
- **Dramatically superior performance on BF16 models** (112% faster prompt processing, 222% faster text generation on Qwen3 MoE 30B)
|
||||
- Optimized native floating-point operations through HIP compute
|
||||
- Better suited for models using native precision formats
|
||||
|
||||
**📊 Performance by Model Type:**
|
||||
- **BF16/Native Precision Models**: ROCm 6.4.2 is the clear winner with 2-3x better performance
|
||||
- **Small Quantized Models**: Vulkan has significant advantages for prompt processing
|
||||
- **Large Quantized Models**: Performance is similar between backends (differences within noise)
|
||||
- **Large Models (>64GB)**: Vulkan is the only viable option due to ROCm's memory allocation issues
|
||||
|
||||
**❌ ROCm 6.4.2 Limitations:**
|
||||
- Extremely slow memory loading for models >64GB (unusable)
|
||||
- Performance advantage limited to BF16/native precision models
|
||||
|
||||
**❌ ROCm 7.0 Beta Issues:**
|
||||
- GPU hangs/crashes on larger models (Llama4 17B causes "GPU Hang" and core dump)
|
||||
- Similar slow loading issues as ROCm 6.4.2 for models >64GB
|
||||
- Performance similar to ROCm 6.4.2 when it works, but reliability is poor
|
||||
- Uses [official AMD RPMs](https://repo.radeon.com/rocm/el9/7.0_beta/main) (beta quality)
|
||||
|
||||
**💡 Recommendation Strategy:**
|
||||
- Use **ROCm 6.4.2** for BF16/native precision models under 64GB
|
||||
- Use **Vulkan** for quantized models (especially smaller ones) and all models over 64GB
|
||||
- For large quantized models under 64GB, either backend performs similarly
|
||||
- Avoid ROCm 7.0 beta for production workloads
|
||||
|
||||
## 5. Memory Planning
|
||||
|
||||
VRAM usage has three components: **Model Weights + Context Memory (KV Cache) + Overhead**. The `gguf-vram-estimator.py` tool helps you choose the right model quantization and context size to fit within 128GB.
|
||||
|
||||
### 5.1 The `gguf-vram-estimator.py` Utility
|
||||
|
||||
Calculate total VRAM requirements for different context lengths:
|
||||
|
||||
```bash
|
||||
# Basic usage
|
||||
gguf-vram-estimator.py <path-to-gguf-file> [options]
|
||||
```
|
||||
|
||||
**Key Options:**
|
||||
- `--contexts`: Space-separated list of context sizes (e.g., `--contexts 4096 16384`)
|
||||
- `--overhead`: Estimated overhead in GiB (default: `2.0`)
|
||||
|
||||
### 5.2 Practical Examples: Planning for a 128GB Strix Halo System
|
||||
|
||||
#### Scenario 1: High Quality, Short Context (Coding & Chat)
|
||||
|
||||
```bash
|
||||
gguf-vram-estimator.py models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf
|
||||
```
|
||||
```
|
||||
--- Model 'Llama-4-Scout-17B-16E-Instruct' ---
|
||||
Max Context: 10,485,760 tokens
|
||||
Model Size: 106.67 GiB (from file size)
|
||||
Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)
|
||||
Model Size: 57.74 GiB
|
||||
Incl. Overhead: 2.00 GiB
|
||||
|
||||
--- Memory Footprint Estimation ---
|
||||
Context Size | Context Memory | Est. Total VRAM
|
||||
---------------------------------------------------
|
||||
4,096 | 768.00 MiB | 109.42 GiB
|
||||
8,192 | 1.50 GiB | 110.17 GiB
|
||||
16,384 | 1.88 GiB | 110.54 GiB
|
||||
4,096 | 1.88 GiB | 61.62 GiB
|
||||
32,768 | 15.06 GiB | 74.80 GiB
|
||||
1,048,576 | 49.12 GiB | 108.87 GiB
|
||||
```
|
||||
**Analysis:** The `Q8_0` model consumes **106.7 GiB**. A 16k context adds another **~1.9 GiB**, for a total of **~111 GiB**. This fits comfortably within a 128GB system.
|
||||
|
||||
#### Scenario 2: Large Context, Lower Precision (Long Document/Data/Code Analysis, Back-and-Forth Feedback)
|
||||
With Q4\_K quantization, **Llama-4-Scout 17B** can reach a 1M token context and still fit within a 128GB system, but... **it will be extremely slow to process such a long context**: see benchmarks (e.g. \~200 tokens/sec for prompt processing). Processing a 1M token context may take hours.
|
||||
|
||||
Contrast: Qwen3-235B Q3\_K (quantized, 97GiB model):
|
||||
|
||||
```bash
|
||||
gguf-vram-estimator.py models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf
|
||||
```
|
||||
```
|
||||
--- Model 'Llama-4-Scout-17B-16E-Instruct' ---
|
||||
Max Context: 10,485,760 tokens
|
||||
Model Size: 57.74 GiB (from file size)
|
||||
Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)
|
||||
$ gguf-vram-estimator.py models/qwen3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf --contexts 65536 131072 262144
|
||||
|
||||
--- Memory Footprint Estimation ---
|
||||
Context Size | Context Memory | Est. Total VRAM
|
||||
---------------------------------------------------
|
||||
524,288 | 25.12 GiB | 84.87 GiB
|
||||
1,048,576 | 49.12 GiB | 108.87 GiB
|
||||
```
|
||||
**Analysis:** To enable this, we use the `Q4_K_XL` quantization of Llama-4-Scout that is only **57.7 GiB**. The 1M token context adds a massive **49.1 GiB** of memory. The total, **~109 GiB**, is a tight but achievable fit on a 128GB system.
|
||||
|
||||
#### Scenario 3: Fitting a Very Large Model
|
||||
|
||||
```bash
|
||||
gguf-vram-estimator.py models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf
|
||||
```
|
||||
```
|
||||
--- Model 'Qwen3-235B-A22B-Instruct-2507' ---
|
||||
Max Context: 262,144 tokens
|
||||
Model Size: 97.00 GiB (from file size)
|
||||
Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)
|
||||
|
||||
--- Memory Footprint Estimation ---
|
||||
Context Size | Context Memory | Est. Total VRAM
|
||||
---------------------------------------------------
|
||||
65,536 | 11.75 GiB | 110.75 GiB
|
||||
131,072 | 23.50 GiB | 122.50 GiB
|
||||
262,144 | 47.00 GiB | 146.00 GiB
|
||||
```
|
||||
**Analysis:** The base model takes **97 GiB**. You have approximately **30 GiB** of headroom. This allows for a very large context of **~131k tokens** before exceeding the system's 128GB capacity. Attempting the full 262k context would require `146 GiB` and fail.
|
||||
|
||||
## 6. Building Containers Locally (Optional)
|
||||
|
||||
```bash
|
||||
# Build all variants
|
||||
podman build -t localhost/llama-vulkan -f Dockerfile.vulkan .
|
||||
podman build -t localhost/llama-rocm-6.4.2 -f Dockerfile.rocm-6.4.2 .
|
||||
podman build -t localhost/llama-rocm-7beta -f Dockerfile.rocm-7beta .
|
||||
65,536 | 11.75 GiB | 110.75 GiB
|
||||
131,072 | 23.50 GiB | 122.50 GiB
|
||||
262,144 | 47.00 GiB | 146.00 GiB
|
||||
```
|
||||
|
||||
### Create Toolboxes from Local Images
|
||||
For Qwen3-235B, **128GB RAM allows you to run with context up to \~130k tokens.**
|
||||
|
||||
```bash
|
||||
# Using locally built images
|
||||
toolbox create llama-vulkan-local \
|
||||
--image localhost/llama-vulkan \
|
||||
-- \
|
||||
--device /dev/dri \
|
||||
--group-add video \
|
||||
--security-opt seccomp=unconfined
|
||||
* The estimator lets you plan ahead and avoid out-of-memory errors when loading or using models.
|
||||
* For more examples and a breakdown of VRAM components, see docs/vram-estimator.md.
|
||||
|
||||
toolbox create llama-rocm-local \
|
||||
--image localhost/llama-rocm-6.4.2 \
|
||||
-- \
|
||||
--device /dev/kfd \
|
||||
--device /dev/dri \
|
||||
--group-add video \
|
||||
--security-opt seccomp=unconfined
|
||||
```
|
||||
---
|
||||
|
||||
## 7. Host Configuration
|
||||
## 5. Building Containers Locally
|
||||
|
||||
Pre-built toolbox container images are published on Docker Hub for immediate use. If you wish to build the containers yourself (for example, to customize packages or rebuild with a different llama.cpp version), see:
|
||||
|
||||
Full instructions: docs/building.md
|
||||
|
||||
---
|
||||
|
||||
## 6. Host Configuration
|
||||
|
||||
This should work on any Strix Halo. For a complete list of available hardware, see: [Strix Halo Hardware Database](https://strixhalo-homelab.d7.wtf/Hardware)
|
||||
|
||||
### 7.1 Test Configuration
|
||||
### 6.1 Test Configuration
|
||||
|
||||
| Component | Specification |
|
||||
|-----------|---------------|
|
||||
| **Test Machine** | HP Z2 Mini G1a |
|
||||
| **CPU** | Ryzen AI MAX+ 395 "Strix Halo" |
|
||||
| **System Memory** | 128 GB RAM |
|
||||
| **GPU Memory** | 512 MB allocated in BIOS |
|
||||
| **Host OS** | Fedora 42, kernel 6.15.6-200.fc42.x86_86_64 |
|
||||
| | |
|
||||
| ----------------- | --------------------------------------------- |
|
||||
| **Test Machine** | HP Z2 Mini G1a |
|
||||
| **CPU** | Ryzen AI MAX+ 395 "Strix Halo" |
|
||||
| **System Memory** | 128 GB RAM |
|
||||
| **GPU Memory** | 512 MB allocated in BIOS |
|
||||
| **Host OS** | Fedora 42, kernel 6.15.6-200.fc42.x86\_86\_64 |
|
||||
|
||||
### 7.2 Kernel Parameters (tested on Fedora 42)
|
||||
### 6.2 Kernel Parameters (tested on Fedora 42)
|
||||
|
||||
Add these boot parameters to enable unified memory and optimal performance:
|
||||
|
||||
@@ -309,23 +194,35 @@ Add these boot parameters to enable unified memory and optimal performance:
|
||||
amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=335544321
|
||||
```
|
||||
|
||||
| Parameter | Purpose |
|
||||
|-----------|---------|
|
||||
| `amd_iommu=off` | Disables IOMMU for lower latency |
|
||||
| `amdgpu.gttsize=131072` | Enables unified GPU/system memory (up to 128 GB) |
|
||||
| `ttm.pages_limit=335544321` | Allows large pinned memory allocations |
|
||||
| Parameter | Purpose |
|
||||
| --------------------------- | ------------------------------------------------ |
|
||||
| `amd_iommu=off` | Disables IOMMU for lower latency |
|
||||
| `amdgpu.gttsize=131072` | Enables unified GPU/system memory (up to 128 GB) |
|
||||
| `ttm.pages_limit=335544321` | Allows large pinned memory allocations |
|
||||
|
||||
**Apply the changes:**
|
||||
```bash
|
||||
|
||||
```
|
||||
# Edit /etc/default/grub to add parameters to GRUB_CMDLINE_LINUX
|
||||
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
|
||||
sudo reboot
|
||||
```
|
||||
|
||||
### 7.3 Ubuntu 24.04
|
||||
### 6.3 Ubuntu 24.04
|
||||
|
||||
Follow this guide by TechnigmaAI for a working configuration on Ubuntu 22.04:
|
||||
Follow this guide by TechnigmaAI for a working configuration on Ubuntu 24.04:
|
||||
|
||||
https://github.com/technigmaai/technigmaai-wiki/wiki/AMD-Ryzen-AI-Max--395:-GTT--Memory-Step%E2%80%90by%E2%80%90Step-Instructions-(Ubuntu-24.04)
|
||||
[https://github.com/technigmaai/technigmaai-wiki/wiki/AMD-Ryzen-AI-Max--395:-GTT--Memory-Step%E2%80%90by%E2%80%90Step-Instructions-(Ubuntu-24.04)](https://github.com/technigmaai/technigmaai-wiki/wiki/AMD-Ryzen-AI-Max--395:-GTT--Memory-Step%E2%80%90by%E2%80%90Step-Instructions-%28Ubuntu-24.04%29)
|
||||
|
||||
## 7. More Documentation
|
||||
|
||||
* docs/benchmarks.md: Full benchmark logs, model list, parsed results
|
||||
* docs/vram-estimator.md: Memory planning, practical example runs
|
||||
* docs/building.md: Local build, toolbox customization, advanced use
|
||||
|
||||
## 8. References
|
||||
|
||||
* The main reference for AMD Ryzen AI MAX home labs, by deseven (there's also a Discord server): [https://strixhalo-homelab.d7.wtf/](https://strixhalo-homelab.d7.wtf/)
|
||||
* Most comprehesive repostiry of test builds for Strix Halo by lhl -> [https://github.com/lhl/strix-halo-testing/tree/main](https://github.com/lhl/strix-halo-testing/tree/main)
|
||||
* Ubunru 24.04 configuration
|
||||
[https://github.com/technigmaai/technigmaai-wiki/wiki/AMD-Ryzen-AI-Max--395:-GTT--Memory-Step%E2%80%90by%E2%80%90Step-Instructions-(Ubuntu-24.04)](https://github.com/technigmaai/technigmaai-wiki/wiki/AMD-Ryzen-AI-Max--395:-GTT--Memory-Step%E2%80%90by%E2%80%90Step-Instructions-%28Ubuntu-24.04%29)
|
||||
|
||||
Reference in New Issue
Block a user