Fixed ToC and added AMDVLK bug to track loading issues

This commit is contained in:
Donato Capitella
2025-08-03 13:40:13 +01:00
parent e7e27e6cf3
commit 792dc9621f
+32 -15
View File
@@ -10,21 +10,24 @@ This project provides pre-built containers (“toolboxes”) for running LLMs on
## Table of Contents ## Table of Contents
1. [Llama.cpp Compiled for Every Backend](#1-llamacpp-compiled-for-every-backend) 1. [Llama.cpp Compiled for Every Backend](#1-llamacpp-compiled-for-every-backend)
1.1 [Supported Container Images](#11-supported-container-images) 1.1 [Supported Container Images](#11-supported-container-images)
2. [Quickest Usage Example](#2-quickest-usage-example) 2. [Quickest Usage Example](#2-quickest-usage-example)
2.1 [Creating the toolboxes with GPU access](#21-creating-the-toolboxes-with-gpu-access) 2.1 [Creating the toolboxes with GPU access](#21-creating-the-toolboxes-with-gpu-access)
2.2 [Running models inside the toolboxes](#22-running-models-inside-the-toolboxes) 2.2 [Running models inside the toolboxes](#22-running-models-inside-the-toolboxes)
3. [Performance Benchmarks (Key Results)](#3-performance-benchmarks-key-results) 2.3 [Downloading GGUF Models from HuggingFace](#23-downloading-gguf-models-from-huggingface)
4. [Memory Planning & VRAM Estimator](#4-memory-planning--vram-estimator) 3. [Performance Benchmarks (Key Results)](#3-performance-benchmarks-key-results)
5. [Building Containers Locally](#5-building-containers-locally) 4. [Memory Planning & VRAM Estimator](#4-memory-planning--vram-estimator)
6. [Host Configuration](#6-host-configuration) 5. [Building Containers Locally](#5-building-containers-locally)
6.1 [Test Configuration](#61-test-configuration) 6. [Host Configuration](#6-host-configuration)
6.2 [Kernel Parameters (tested on Fedora 42)](#62-kernel-parameters-tested-on-fedora-42) 6.1 [Test Configuration](#61-test-configuration)
6.3 [Ubuntu 24.04](#63-ubuntu-2404) 6.2 [Kernel Parameters (tested on Fedora 42)](#62-kernel-parameters-tested-on-fedora-42)
7. [More Documentation](#7-more-documentation) 6.3 [Ubuntu 24.04](#63-ubuntu-2404)
7. [More Documentation](#7-more-documentation)
8. [References](#8-references) 8. [References](#8-references)
## 1. Llama.cpp Compiled for Every Backend ## 1. Llama.cpp Compiled for Every Backend
This project uses [Llama.cpp](https://github.com/ggerganov/llama.cpp), a high-performance inference engine for running local LLMs (large language models) on CPUs and GPUs. Llama.cpp is open source, extremely fast, and is the only engine supporting all key backends for AMD Strix Halo: Vulkan (RADV, AMDVLK) and ROCm/HIP This project uses [Llama.cpp](https://github.com/ggerganov/llama.cpp), a high-performance inference engine for running local LLMs (large language models) on CPUs and GPUs. Llama.cpp is open source, extremely fast, and is the only engine supporting all key backends for AMD Strix Halo: Vulkan (RADV, AMDVLK) and ROCm/HIP
@@ -96,6 +99,20 @@ Once inside, the following commands show how to run local LLMs:
* `llama-cli --no-mmap --ngl 999 -fa -m <model>` * `llama-cli --no-mmap --ngl 999 -fa -m <model>`
*Runs inference on the specified model, with all layers on GPU and flash attention enabled (replace \*\* with your model path).* *Runs inference on the specified model, with all layers on GPU and flash attention enabled (replace \*\* with your model path).*
## 2.3 Downloading GGUF Models from HuggingFace
Most Llama.cpp-compatible models are on [HuggingFace](https://huggingface.co/models?format=gguf). Filter for **GGUF** format, and try to pick Unsloth quantizations—they work great and are actively updated: https://huggingface.co/unsloth.
Download using the Hugging Face CLI. For example, to get the first shard of Qwen3 Coder 30B BF16 (https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF):
```bash
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
--local-dir models/qwen3-coder-30B-A3B/
```
`HF_HUB_ENABLE_HF_TRANSFER=1` uses a Rust-based package that enables faster download (install from [Pypi](https://pypi.org/project/hf-transfer/)).
## 3. Performance Benchmarks (Key Results) ## 3. Performance Benchmarks (Key Results)
Below are some results from real runs on Strix Halo hardware of `llama-bench`. For full tables and model-by-model breakdowns (including both prompt processing and token generation speeds), see docs/benchmarks.md. Below are some results from real runs on Strix Halo hardware of `llama-bench`. For full tables and model-by-model breakdowns (including both prompt processing and token generation speeds), see docs/benchmarks.md.
@@ -116,9 +133,9 @@ Below are some results from real runs on Strix Halo hardware of `llama-bench`. F
**Takeaways:** **Takeaways:**
* **Vulkan AMDVLK** is the fastestwhen it works. May crash on large or BF16 models. * **Vulkan AMDVLK** is the fastest, when it works. There's currently an issue with memory allocation that causes some models to fail loading ([GitHub Issue 15054](https://github.com/ggml-org/llama.cpp/issues/15054)).
* **Vulkan RADV** is the most stable and compatible (recommended for most usage). * **Vulkan RADV** is the most stable and compatible (recommended for most usage).
* **ROCm** is only superior on BF16 models, otherwise less stable and may crash or hang. * **ROCm** is typically only superior on BF16 models, otherwise less stable and may crash or hang.
## 4. Memory Planning & VRAM Estimator ## 4. Memory Planning & VRAM Estimator