Added gguf-vram-estimator.py

2025-07-31 12:39:17 +01:00
parent c6678c53d5
commit a193f367d4
5 changed files with 276 additions and 26 deletions
@@ -31,5 +31,8 @@ RUN git clean -xdf \
 RUN find /opt/llama.cpp/build -type f -name 'lib*.so*' -exec cp {} /usr/lib64/ \; \
 && ldconfig

+COPY gguf-vram-estimator.py /usr/local/bin/gguf-vram-estimator.py
+RUN chmod +x /usr/local/bin/gguf-vram-estimator.py
+
 # Default to interactive shell
 CMD ["/bin/bash"]
@@ -55,5 +55,8 @@ RUN git clean -xdf \
 RUN find /opt/llama.cpp/build -type f -name 'lib*.so*' -exec cp {} /usr/lib64/ \; \
 && ldconfig

+COPY gguf-vram-estimator.py /usr/local/bin/gguf-vram-estimator.py
+RUN chmod +x /usr/local/bin/gguf-vram-estimator.py
+
 # Default to interactive shell
 CMD ["/bin/bash"]
@@ -27,4 +27,7 @@ RUN git clean -xdf \
 && cmake --build build --config Release \
 && cmake --install build --config Release

+COPY gguf-vram-estimator.py /usr/local/bin/gguf-vram-estimator.py
+RUN chmod +x /usr/local/bin/gguf-vram-estimator.py
+
 CMD ["/bin/bash"]
@@ -157,6 +157,111 @@ All benchmarks performed on HP Z2 Mini G1a with 128GB RAM, using `llama-bench` w
 - For large quantized models under 64GB, either backend performs similarly
 - Avoid ROCm 7.0 beta for production workloads

+
+## VRAM Planning with `gguf-vram-estimator.py`
+
+### Why Model File Size Isn't the Whole Story
+
+A model's VRAM footprint has three main components:
+
+1.  **Model Weights:** The static size of the model on disk.
+2.  **Context Memory (KV Cache):** A dynamic buffer that grows linearly with the number of tokens in your context. For every token processed, a Key/Value state is stored in VRAM. This is often the largest variable consumer of memory.
+3.  **Overhead:** A semi-fixed amount of memory for compute buffers, drivers, and other scratchpads. This can be 1-3 GiB or more.
+
+The total VRAM required is `Model Size + Context Memory + Overhead`. To run a model, you must have enough memory for all three.
+
+### The `gguf-vram-estimator.py` Utility
+
+To help plan your workload, this repository includes the `gguf-vram-estimator.py` utility. It inspects a GGUF file and calculates the total VRAM needed for different context lengths, allowing you to make informed decisions about which model quantization and context size will fit in your system's memory.
+
+#### How to Use
+
+The script is included in the container. Run it by pointing it at the first part of any GGUF model:
+
+```bash
+# Syntax
+gguf-vram-estimator.py <path-to-gguf-file> [options]
+```
+**Key Options:**
+- `--contexts`: A space-separated list of context sizes to calculate (e.g., `--contexts 4096 16384`).
+- `--overhead`: A float value to set the estimated overhead in GiB (default: `2.0`).
+
+### Practical Examples: Planning for a 128GB Strix Halo System
+
+The key to using a unified memory system is balancing model quality (quantization) against context length.
+
+#### Scenario 1: High Quality, Short Context (Coding & Chat)
+
+You need the highest precision for tasks that don't require massive context windows.
+
+**Goal:** Run the highest quality `Llama-4-Scout` model (`Q8_0`) with a standard 8k-16k context.
+
+```bash
+gguf-vram-estimator.py models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf
+```
+```
+--- Model 'Llama-4-Scout-17B-16E-Instruct' ---
+Max Context: 10,485,760 tokens
+Model Size: 106.67 GiB (from file size)
+Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)
+
+--- Memory Footprint Estimation ---
+   Context Size |  Context Memory | Est. Total VRAM
+---------------------------------------------------
+          4,096 |      768.00 MiB |      109.42 GiB
+          8,192 |        1.50 GiB |      110.17 GiB
+         16,384 |        1.88 GiB |      110.54 GiB
+```
+**Analysis:** The `Q8_0` model consumes **106.7 GiB**. A 16k context adds another **~1.9 GiB**, for a total of **~111 GiB**. This fits comfortably within a 128GB system.
+
+#### Scenario 2: Massive Context, Lower Precision (RAG & Document Analysis)
+
+You need to process a huge amount of text and are willing to trade some precision for a massive context window.
+
+**Goal:** Run `Llama-4-Scout` with a 1,000,000 token context. This requires a much smaller model quantization (`Q4_K_XL`).
+
+```bash
+gguf-vram-estimator.py models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf
+```
+```
+--- Model 'Llama-4-Scout-17B-16E-Instruct' ---
+Max Context: 10,485,760 tokens
+Model Size: 57.74 GiB (from file size)
+Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)
+
+--- Memory Footprint Estimation ---
+   Context Size |  Context Memory | Est. Total VRAM
+---------------------------------------------------
+        524,288 |       25.12 GiB |       84.87 GiB
+      1,048,576 |       49.12 GiB |      108.87 GiB
+```
+**Analysis:** To enable this, we use a `Q4_K_XL` model that is only **57.7 GiB**. The 1M token context adds a massive **49.1 GiB** of memory. The total, **~109 GiB**, is a tight but achievable fit on a 128GB system.
+
+#### Scenario 3: Fitting a Very Large Model
+
+**Goal:** Determine the maximum viable context for the huge `Qwen3-235B` model.
+
+```bash
+gguf-vram-estimator.py models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf
+```
+```
+--- Model 'Qwen3-235B-A22B-Instruct-2507' ---
+Max Context: 262,144 tokens
+Model Size: 97.00 GiB (from file size)
+Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)
+
+--- Memory Footprint Estimation ---
+   Context Size |  Context Memory | Est. Total VRAM
+---------------------------------------------------
+         65,536 |       11.75 GiB |      110.75 GiB
+        131,072 |       23.50 GiB |      122.50 GiB
+        262,144 |       47.00 GiB |      146.00 GiB
+```
+**Analysis:** The base model takes **97 GiB**. You have approximately **30 GiB** of headroom. This allows for a very large context of **~131k tokens** before exceeding the system's 128GB capacity. Attempting the full 262k context would require `146 GiB` and fail.
+
+> **Key Takeaway:** This tool is essential for balancing **Model Quality (Quantization)** vs. **Context Length** to fit your specific task within your system's VRAM limits.
+
+
 ## Building Containers Locally (Optional)

 If you prefer to build the containers yourself:
@@ -221,29 +326,3 @@ amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=335544321
 sudo grub2-mkconfig -o /boot/grub2/grub.cfg
 sudo reboot
 ```
-
-## Troubleshooting
-
-### Common Issues
-
-| Issue | Solution |
-|-------|----------|
-| GPU not detected | Verify `/dev/dri` and `/dev/kfd` devices exist on host |
-| Memory errors | Check that kernel parameters are properly applied |
-| Permission denied | Ensure your user is in the `video` group |
-| ROCm crashes | Try Vulkan backend instead |
-| Slow loading (>64GB models) | Use Vulkan instead of ROCm for large models |
-
-### Verify GPU Access
-
-```bash
-# Check devices
-ls -la /dev/dri /dev/kfd
-
-# Check ROCm (in ROCm containers)
-rocm-smi
-
-# Check Vulkan (in Vulkan container)
-vulkaninfo --summary
-```
-
@@ -0,0 +1,162 @@
+#!/usr/bin/env python3
+import sys
+import os
+import re
+import struct
+import argparse
+import math
+from typing import Dict, Any, List
+
+# GGUF constants
+GGUF_MAGIC = 0x46554747
+GGUF_VALUE_TYPE = {
+    0: "UINT8", 1: "INT8", 2: "UINT16", 3: "INT16", 4: "UINT32",
+    5: "INT32", 6: "FLOAT32", 7: "BOOL", 8: "STRING", 9: "ARRAY",
+}
+
+class GGUFMetadataReader:
+    """A minimal reader to get only the necessary KV metadata for cache calculation."""
+    def __init__(self, path: str):
+        self.path = path
+        self.metadata: Dict[str, Any] = {}
+
+    def read(self):
+        with open(self.path, "rb") as f:
+            self.f = f
+            magic, _, _, metadata_kv_count = struct.unpack("<IIQQ", self.f.read(24))
+            if magic != GGUF_MAGIC: raise ValueError("Invalid GGUF magic number")
+            self._read_metadata(metadata_kv_count)
+        return self
+
+    def _read_string(self) -> str:
+        (length,) = struct.unpack("<Q", self.f.read(8))
+        return self.f.read(length).decode("utf-8", errors="replace")
+
+    def _read_value(self, value_type_idx: int):
+        value_type = GGUF_VALUE_TYPE.get(value_type_idx)
+        if not value_type: raise ValueError(f"Unknown GGUF value type: {value_type_idx}")
+        if value_type == "STRING": return self._read_string()
+        if value_type == "UINT32": return struct.unpack("<I", self.f.read(4))[0]
+        if value_type == "INT32": return struct.unpack("<i", self.f.read(4))[0]
+        self._skip_value(value_type_idx)
+
+    def _skip_value(self, value_type_idx: int):
+        value_type = GGUF_VALUE_TYPE.get(value_type_idx)
+        if not value_type: return
+        if value_type in ("UINT8", "INT8", "BOOL"): self.f.seek(1, 1)
+        elif value_type in ("UINT16", "INT16"): self.f.seek(2, 1)
+        elif value_type in ("UINT32", "INT32", "FLOAT32"): self.f.seek(4, 1)
+        elif value_type == "STRING":
+            (length,) = struct.unpack("<Q", self.f.read(8))
+            self.f.seek(length, 1)
+        elif value_type == "ARRAY":
+            (array_type_idx, count) = struct.unpack("<IQ", self.f.read(12))
+            type_map = {0:1, 1:1, 2:2, 3:2, 4:4, 5:4, 6:4, 7:1, 10:8, 11:8, 12:8}
+            element_size = type_map.get(array_type_idx)
+            if element_size: self.f.seek(count * element_size, 1)
+            else:
+                for _ in range(count): self._skip_value(8)
+
+    def _read_metadata(self, count: int):
+        keys_to_read = {"general.architecture", "general.name"}
+        arch_specific_keys_added = False
+        for _ in range(count):
+            key = self._read_string()
+            (value_type_idx,) = struct.unpack("<I", self.f.read(4))
+            if not arch_specific_keys_added and "general.architecture" in self.metadata:
+                prefix = self.metadata["general.architecture"]
+                keys_to_read.update({
+                    f"{prefix}.block_count", f"{prefix}.context_length",
+                    f"{prefix}.attention.head_count_kv", f"{prefix}.attention.key_length",
+                    f"{prefix}.attention.value_length", f"{prefix}.attention.sliding_window_size"
+                })
+                arch_specific_keys_added = True
+            if key in keys_to_read:
+                self.metadata[key] = self._read_value(value_type_idx)
+            else:
+                self._skip_value(value_type_idx)
+
+def get_total_model_size_from_disk(gguf_file_path: str) -> int:
+    """Calculates the total model size by finding all parts on disk."""
+    match = re.search(r'-(\d{5})-of-(\d{5})\.gguf$', gguf_file_path, re.IGNORECASE)
+    if not match:
+        return os.path.getsize(gguf_file_path)
+
+    base_path = gguf_file_path[:match.start()]
+    total_parts_str = match.group(2)
+    total_parts = int(total_parts_str)
+    total_size, found_parts = 0, 0
+    for i in range(1, total_parts + 1):
+        part_file_name = f"{base_path}-{i:05d}-of-{total_parts_str}.gguf"
+        if os.path.exists(part_file_name):
+            total_size += os.path.getsize(part_file_name)
+            found_parts += 1
+    if found_parts != total_parts:
+        print(f"WARNING: Expected {total_parts} parts, found {found_parts}. Size calculation may be incomplete.", file=sys.stderr)
+    return total_size
+
+def format_mem(size_bytes):
+    mib = size_bytes / (1024 * 1024)
+    if mib < 1024: return f"{mib:8.2f} MiB"
+    return f"{mib / 1024:8.2f} GiB"
+
+def run_estimator(gguf_file: str, context_sizes: List[int], overhead_gib: float):
+    try:
+        reader = GGUFMetadataReader(gguf_file).read()
+        metadata = reader.metadata
+        prefix = metadata.get("general.architecture")
+        if not prefix: raise KeyError("Could not read 'general.architecture' from model metadata.")
+        
+        model_size_bytes = get_total_model_size_from_disk(gguf_file)
+        overhead_bytes = int(overhead_gib * 1024**3)
+
+        n_layers = metadata[f"{prefix}.block_count"]
+        n_head_kv = metadata[f"{prefix}.attention.head_count_kv"]
+        training_context = metadata.get(f"{prefix}.context_length", 0)
+        n_embd_head_k = metadata[f"{prefix}.attention.key_length"]
+        n_embd_head_v = metadata[f"{prefix}.attention.value_length"]
+        swa_window_size = metadata.get(f"{prefix}.attention.sliding_window_size", 0)
+        
+        is_scout_model = "scout" in metadata.get("general.name", "").lower()
+        if is_scout_model and swa_window_size == 0: n_layers_swa, n_layers_full, swa_window_size = 36, 12, 8192
+        elif swa_window_size > 0: n_layers_swa, n_layers_full = n_layers, 0
+        else: n_layers_swa, n_layers_full = 0, n_layers
+
+        print(f"\n--- Model '{metadata.get('general.name', 'N/A')}' ---")
+        if training_context > 0: print(f"Max Context: {training_context:,} tokens")
+        print(f"Model Size: {format_mem(model_size_bytes).strip()} (from file size)")
+        print(f"Incl. Overhead: {overhead_gib:.2f} GiB (for compute buffer, etc. adjustable via --overhead)")
+        
+        if training_context > 0:
+            context_sizes = sorted(list(set([c for c in context_sizes if c <= training_context] + [c for c in [training_context] if c not in context_sizes])))
+        else: context_sizes = sorted(context_sizes)
+        
+        bytes_per_token_per_layer = n_head_kv * (n_embd_head_k + n_embd_head_v) * 2
+        
+        print("\n--- Memory Footprint Estimation ---")
+        print(f"{'Context Size':>15s} | {'Context Memory':>15s} | {'Est. Total VRAM':>15s}")
+        print("-" * 51)
+        for n_ctx in context_sizes:
+            mem_full = n_ctx * n_layers_full * bytes_per_token_per_layer
+            mem_swa = min(n_ctx, swa_window_size) * n_layers_swa * bytes_per_token_per_layer
+            kv_cache_bytes = mem_full + mem_swa
+            total_bytes = model_size_bytes + kv_cache_bytes + overhead_bytes
+            print(f"{n_ctx:>15,} | {format_mem(kv_cache_bytes):>15s} | {format_mem(total_bytes):>15s}")
+            
+    except (FileNotFoundError, ValueError, struct.error, NotImplementedError, KeyError) as e:
+        print(f"\nError: {e}", file=sys.stderr)
+        sys.exit(1)
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Calculate VRAM requirements for a GGUF model, including a configurable overhead for compute buffers.",
+        formatter_class=argparse.RawTextHelpFormatter
+    )
+    parser.add_argument("gguf_file", help="Path to the GGUF model file (any part of a multi-part model).")
+    parser.add_argument("-c", "--contexts", nargs='+', type=int, default=[4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576], help="Space-separated list of context sizes to calculate.")
+    parser.add_argument("--overhead", type=float, default=2.0, help="Estimated overhead in GiB for compute buffers, drivers, etc. (default: 2.0)")
+    args = parser.parse_args()
+    run_estimator(args.gguf_file, args.contexts, args.overhead)
+
+if __name__ == "__main__":
+    main()