Added gguf-vram-estimator.py

This commit is contained in:
Donato Capitella
2025-07-31 12:39:17 +01:00
parent c6678c53d5
commit a193f367d4
5 changed files with 276 additions and 26 deletions
+3
View File
@@ -31,5 +31,8 @@ RUN git clean -xdf \
RUN find /opt/llama.cpp/build -type f -name 'lib*.so*' -exec cp {} /usr/lib64/ \; \ RUN find /opt/llama.cpp/build -type f -name 'lib*.so*' -exec cp {} /usr/lib64/ \; \
&& ldconfig && ldconfig
COPY gguf-vram-estimator.py /usr/local/bin/gguf-vram-estimator.py
RUN chmod +x /usr/local/bin/gguf-vram-estimator.py
# Default to interactive shell # Default to interactive shell
CMD ["/bin/bash"] CMD ["/bin/bash"]
+3
View File
@@ -55,5 +55,8 @@ RUN git clean -xdf \
RUN find /opt/llama.cpp/build -type f -name 'lib*.so*' -exec cp {} /usr/lib64/ \; \ RUN find /opt/llama.cpp/build -type f -name 'lib*.so*' -exec cp {} /usr/lib64/ \; \
&& ldconfig && ldconfig
COPY gguf-vram-estimator.py /usr/local/bin/gguf-vram-estimator.py
RUN chmod +x /usr/local/bin/gguf-vram-estimator.py
# Default to interactive shell # Default to interactive shell
CMD ["/bin/bash"] CMD ["/bin/bash"]
+3
View File
@@ -27,4 +27,7 @@ RUN git clean -xdf \
&& cmake --build build --config Release \ && cmake --build build --config Release \
&& cmake --install build --config Release && cmake --install build --config Release
COPY gguf-vram-estimator.py /usr/local/bin/gguf-vram-estimator.py
RUN chmod +x /usr/local/bin/gguf-vram-estimator.py
CMD ["/bin/bash"] CMD ["/bin/bash"]
+105 -26
View File
@@ -157,6 +157,111 @@ All benchmarks performed on HP Z2 Mini G1a with 128GB RAM, using `llama-bench` w
- For large quantized models under 64GB, either backend performs similarly - For large quantized models under 64GB, either backend performs similarly
- Avoid ROCm 7.0 beta for production workloads - Avoid ROCm 7.0 beta for production workloads
## VRAM Planning with `gguf-vram-estimator.py`
### Why Model File Size Isn't the Whole Story
A model's VRAM footprint has three main components:
1. **Model Weights:** The static size of the model on disk.
2. **Context Memory (KV Cache):** A dynamic buffer that grows linearly with the number of tokens in your context. For every token processed, a Key/Value state is stored in VRAM. This is often the largest variable consumer of memory.
3. **Overhead:** A semi-fixed amount of memory for compute buffers, drivers, and other scratchpads. This can be 1-3 GiB or more.
The total VRAM required is `Model Size + Context Memory + Overhead`. To run a model, you must have enough memory for all three.
### The `gguf-vram-estimator.py` Utility
To help plan your workload, this repository includes the `gguf-vram-estimator.py` utility. It inspects a GGUF file and calculates the total VRAM needed for different context lengths, allowing you to make informed decisions about which model quantization and context size will fit in your system's memory.
#### How to Use
The script is included in the container. Run it by pointing it at the first part of any GGUF model:
```bash
# Syntax
gguf-vram-estimator.py <path-to-gguf-file> [options]
```
**Key Options:**
- `--contexts`: A space-separated list of context sizes to calculate (e.g., `--contexts 4096 16384`).
- `--overhead`: A float value to set the estimated overhead in GiB (default: `2.0`).
### Practical Examples: Planning for a 128GB Strix Halo System
The key to using a unified memory system is balancing model quality (quantization) against context length.
#### Scenario 1: High Quality, Short Context (Coding & Chat)
You need the highest precision for tasks that don't require massive context windows.
**Goal:** Run the highest quality `Llama-4-Scout` model (`Q8_0`) with a standard 8k-16k context.
```bash
gguf-vram-estimator.py models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf
```
```
--- Model 'Llama-4-Scout-17B-16E-Instruct' ---
Max Context: 10,485,760 tokens
Model Size: 106.67 GiB (from file size)
Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)
--- Memory Footprint Estimation ---
Context Size | Context Memory | Est. Total VRAM
---------------------------------------------------
4,096 | 768.00 MiB | 109.42 GiB
8,192 | 1.50 GiB | 110.17 GiB
16,384 | 1.88 GiB | 110.54 GiB
```
**Analysis:** The `Q8_0` model consumes **106.7 GiB**. A 16k context adds another **~1.9 GiB**, for a total of **~111 GiB**. This fits comfortably within a 128GB system.
#### Scenario 2: Massive Context, Lower Precision (RAG & Document Analysis)
You need to process a huge amount of text and are willing to trade some precision for a massive context window.
**Goal:** Run `Llama-4-Scout` with a 1,000,000 token context. This requires a much smaller model quantization (`Q4_K_XL`).
```bash
gguf-vram-estimator.py models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf
```
```
--- Model 'Llama-4-Scout-17B-16E-Instruct' ---
Max Context: 10,485,760 tokens
Model Size: 57.74 GiB (from file size)
Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)
--- Memory Footprint Estimation ---
Context Size | Context Memory | Est. Total VRAM
---------------------------------------------------
524,288 | 25.12 GiB | 84.87 GiB
1,048,576 | 49.12 GiB | 108.87 GiB
```
**Analysis:** To enable this, we use a `Q4_K_XL` model that is only **57.7 GiB**. The 1M token context adds a massive **49.1 GiB** of memory. The total, **~109 GiB**, is a tight but achievable fit on a 128GB system.
#### Scenario 3: Fitting a Very Large Model
**Goal:** Determine the maximum viable context for the huge `Qwen3-235B` model.
```bash
gguf-vram-estimator.py models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf
```
```
--- Model 'Qwen3-235B-A22B-Instruct-2507' ---
Max Context: 262,144 tokens
Model Size: 97.00 GiB (from file size)
Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)
--- Memory Footprint Estimation ---
Context Size | Context Memory | Est. Total VRAM
---------------------------------------------------
65,536 | 11.75 GiB | 110.75 GiB
131,072 | 23.50 GiB | 122.50 GiB
262,144 | 47.00 GiB | 146.00 GiB
```
**Analysis:** The base model takes **97 GiB**. You have approximately **30 GiB** of headroom. This allows for a very large context of **~131k tokens** before exceeding the system's 128GB capacity. Attempting the full 262k context would require `146 GiB` and fail.
> **Key Takeaway:** This tool is essential for balancing **Model Quality (Quantization)** vs. **Context Length** to fit your specific task within your system's VRAM limits.
## Building Containers Locally (Optional) ## Building Containers Locally (Optional)
If you prefer to build the containers yourself: If you prefer to build the containers yourself:
@@ -221,29 +326,3 @@ amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=335544321
sudo grub2-mkconfig -o /boot/grub2/grub.cfg sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot sudo reboot
``` ```
## Troubleshooting
### Common Issues
| Issue | Solution |
|-------|----------|
| GPU not detected | Verify `/dev/dri` and `/dev/kfd` devices exist on host |
| Memory errors | Check that kernel parameters are properly applied |
| Permission denied | Ensure your user is in the `video` group |
| ROCm crashes | Try Vulkan backend instead |
| Slow loading (>64GB models) | Use Vulkan instead of ROCm for large models |
### Verify GPU Access
```bash
# Check devices
ls -la /dev/dri /dev/kfd
# Check ROCm (in ROCm containers)
rocm-smi
# Check Vulkan (in Vulkan container)
vulkaninfo --summary
```
+162
View File
@@ -0,0 +1,162 @@
#!/usr/bin/env python3
import sys
import os
import re
import struct
import argparse
import math
from typing import Dict, Any, List
# GGUF constants
GGUF_MAGIC = 0x46554747
GGUF_VALUE_TYPE = {
0: "UINT8", 1: "INT8", 2: "UINT16", 3: "INT16", 4: "UINT32",
5: "INT32", 6: "FLOAT32", 7: "BOOL", 8: "STRING", 9: "ARRAY",
}
class GGUFMetadataReader:
"""A minimal reader to get only the necessary KV metadata for cache calculation."""
def __init__(self, path: str):
self.path = path
self.metadata: Dict[str, Any] = {}
def read(self):
with open(self.path, "rb") as f:
self.f = f
magic, _, _, metadata_kv_count = struct.unpack("<IIQQ", self.f.read(24))
if magic != GGUF_MAGIC: raise ValueError("Invalid GGUF magic number")
self._read_metadata(metadata_kv_count)
return self
def _read_string(self) -> str:
(length,) = struct.unpack("<Q", self.f.read(8))
return self.f.read(length).decode("utf-8", errors="replace")
def _read_value(self, value_type_idx: int):
value_type = GGUF_VALUE_TYPE.get(value_type_idx)
if not value_type: raise ValueError(f"Unknown GGUF value type: {value_type_idx}")
if value_type == "STRING": return self._read_string()
if value_type == "UINT32": return struct.unpack("<I", self.f.read(4))[0]
if value_type == "INT32": return struct.unpack("<i", self.f.read(4))[0]
self._skip_value(value_type_idx)
def _skip_value(self, value_type_idx: int):
value_type = GGUF_VALUE_TYPE.get(value_type_idx)
if not value_type: return
if value_type in ("UINT8", "INT8", "BOOL"): self.f.seek(1, 1)
elif value_type in ("UINT16", "INT16"): self.f.seek(2, 1)
elif value_type in ("UINT32", "INT32", "FLOAT32"): self.f.seek(4, 1)
elif value_type == "STRING":
(length,) = struct.unpack("<Q", self.f.read(8))
self.f.seek(length, 1)
elif value_type == "ARRAY":
(array_type_idx, count) = struct.unpack("<IQ", self.f.read(12))
type_map = {0:1, 1:1, 2:2, 3:2, 4:4, 5:4, 6:4, 7:1, 10:8, 11:8, 12:8}
element_size = type_map.get(array_type_idx)
if element_size: self.f.seek(count * element_size, 1)
else:
for _ in range(count): self._skip_value(8)
def _read_metadata(self, count: int):
keys_to_read = {"general.architecture", "general.name"}
arch_specific_keys_added = False
for _ in range(count):
key = self._read_string()
(value_type_idx,) = struct.unpack("<I", self.f.read(4))
if not arch_specific_keys_added and "general.architecture" in self.metadata:
prefix = self.metadata["general.architecture"]
keys_to_read.update({
f"{prefix}.block_count", f"{prefix}.context_length",
f"{prefix}.attention.head_count_kv", f"{prefix}.attention.key_length",
f"{prefix}.attention.value_length", f"{prefix}.attention.sliding_window_size"
})
arch_specific_keys_added = True
if key in keys_to_read:
self.metadata[key] = self._read_value(value_type_idx)
else:
self._skip_value(value_type_idx)
def get_total_model_size_from_disk(gguf_file_path: str) -> int:
"""Calculates the total model size by finding all parts on disk."""
match = re.search(r'-(\d{5})-of-(\d{5})\.gguf$', gguf_file_path, re.IGNORECASE)
if not match:
return os.path.getsize(gguf_file_path)
base_path = gguf_file_path[:match.start()]
total_parts_str = match.group(2)
total_parts = int(total_parts_str)
total_size, found_parts = 0, 0
for i in range(1, total_parts + 1):
part_file_name = f"{base_path}-{i:05d}-of-{total_parts_str}.gguf"
if os.path.exists(part_file_name):
total_size += os.path.getsize(part_file_name)
found_parts += 1
if found_parts != total_parts:
print(f"WARNING: Expected {total_parts} parts, found {found_parts}. Size calculation may be incomplete.", file=sys.stderr)
return total_size
def format_mem(size_bytes):
mib = size_bytes / (1024 * 1024)
if mib < 1024: return f"{mib:8.2f} MiB"
return f"{mib / 1024:8.2f} GiB"
def run_estimator(gguf_file: str, context_sizes: List[int], overhead_gib: float):
try:
reader = GGUFMetadataReader(gguf_file).read()
metadata = reader.metadata
prefix = metadata.get("general.architecture")
if not prefix: raise KeyError("Could not read 'general.architecture' from model metadata.")
model_size_bytes = get_total_model_size_from_disk(gguf_file)
overhead_bytes = int(overhead_gib * 1024**3)
n_layers = metadata[f"{prefix}.block_count"]
n_head_kv = metadata[f"{prefix}.attention.head_count_kv"]
training_context = metadata.get(f"{prefix}.context_length", 0)
n_embd_head_k = metadata[f"{prefix}.attention.key_length"]
n_embd_head_v = metadata[f"{prefix}.attention.value_length"]
swa_window_size = metadata.get(f"{prefix}.attention.sliding_window_size", 0)
is_scout_model = "scout" in metadata.get("general.name", "").lower()
if is_scout_model and swa_window_size == 0: n_layers_swa, n_layers_full, swa_window_size = 36, 12, 8192
elif swa_window_size > 0: n_layers_swa, n_layers_full = n_layers, 0
else: n_layers_swa, n_layers_full = 0, n_layers
print(f"\n--- Model '{metadata.get('general.name', 'N/A')}' ---")
if training_context > 0: print(f"Max Context: {training_context:,} tokens")
print(f"Model Size: {format_mem(model_size_bytes).strip()} (from file size)")
print(f"Incl. Overhead: {overhead_gib:.2f} GiB (for compute buffer, etc. adjustable via --overhead)")
if training_context > 0:
context_sizes = sorted(list(set([c for c in context_sizes if c <= training_context] + [c for c in [training_context] if c not in context_sizes])))
else: context_sizes = sorted(context_sizes)
bytes_per_token_per_layer = n_head_kv * (n_embd_head_k + n_embd_head_v) * 2
print("\n--- Memory Footprint Estimation ---")
print(f"{'Context Size':>15s} | {'Context Memory':>15s} | {'Est. Total VRAM':>15s}")
print("-" * 51)
for n_ctx in context_sizes:
mem_full = n_ctx * n_layers_full * bytes_per_token_per_layer
mem_swa = min(n_ctx, swa_window_size) * n_layers_swa * bytes_per_token_per_layer
kv_cache_bytes = mem_full + mem_swa
total_bytes = model_size_bytes + kv_cache_bytes + overhead_bytes
print(f"{n_ctx:>15,} | {format_mem(kv_cache_bytes):>15s} | {format_mem(total_bytes):>15s}")
except (FileNotFoundError, ValueError, struct.error, NotImplementedError, KeyError) as e:
print(f"\nError: {e}", file=sys.stderr)
sys.exit(1)
def main():
parser = argparse.ArgumentParser(
description="Calculate VRAM requirements for a GGUF model, including a configurable overhead for compute buffers.",
formatter_class=argparse.RawTextHelpFormatter
)
parser.add_argument("gguf_file", help="Path to the GGUF model file (any part of a multi-part model).")
parser.add_argument("-c", "--contexts", nargs='+', type=int, default=[4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576], help="Space-separated list of context sizes to calculate.")
parser.add_argument("--overhead", type=float, default=2.0, help="Estimated overhead in GiB for compute buffers, drivers, etc. (default: 2.0)")
args = parser.parse_args()
run_estimator(args.gguf_file, args.contexts, args.overhead)
if __name__ == "__main__":
main()