Added gguf-vram-estimator.py

2025-07-31 12:39:17 +01:00
parent c6678c53d5
commit a193f367d4
5 changed files with 276 additions and 26 deletions
@@ -31,5 +31,8 @@ RUN git clean -xdf \
 RUN find /opt/llama.cpp/build -type f -name 'lib*.so*' -exec cp {} /usr/lib64/ \; \
 && ldconfig
 COPY gguf-vram-estimator.py /usr/local/bin/gguf-vram-estimator.py
 RUN chmod +x /usr/local/bin/gguf-vram-estimator.py
 # Default to interactive shell
 CMD ["/bin/bash"]
@@ -55,5 +55,8 @@ RUN git clean -xdf \
 RUN find /opt/llama.cpp/build -type f -name 'lib*.so*' -exec cp {} /usr/lib64/ \; \
 && ldconfig
 COPY gguf-vram-estimator.py /usr/local/bin/gguf-vram-estimator.py
 RUN chmod +x /usr/local/bin/gguf-vram-estimator.py
 # Default to interactive shell
 CMD ["/bin/bash"]
@@ -27,4 +27,7 @@ RUN git clean -xdf \
 && cmake --build build --config Release \
 && cmake --install build --config Release
 COPY gguf-vram-estimator.py /usr/local/bin/gguf-vram-estimator.py
 RUN chmod +x /usr/local/bin/gguf-vram-estimator.py
 CMD ["/bin/bash"]
@@ -157,6 +157,111 @@ All benchmarks performed on HP Z2 Mini G1a with 128GB RAM, using `llama-bench` w
 - For large quantized models under 64GB, either backend performs similarly
 - Avoid ROCm 7.0 beta for production workloads
 ## VRAM Planning with `gguf-vram-estimator.py`
 ### Why Model File Size Isn't the Whole Story
 A model's VRAM footprint has three main components:
 1.  **Model Weights:** The static size of the model on disk.
 2.  **Context Memory (KV Cache):** A dynamic buffer that grows linearly with the number of tokens in your context. For every token processed, a Key/Value state is stored in VRAM. This is often the largest variable consumer of memory.
 3.  **Overhead:** A semi-fixed amount of memory for compute buffers, drivers, and other scratchpads. This can be 1-3 GiB or more.
 The total VRAM required is `Model Size + Context Memory + Overhead`. To run a model, you must have enough memory for all three.
 ### The `gguf-vram-estimator.py` Utility
 To help plan your workload, this repository includes the `gguf-vram-estimator.py` utility. It inspects a GGUF file and calculates the total VRAM needed for different context lengths, allowing you to make informed decisions about which model quantization and context size will fit in your system's memory.
 #### How to Use
 The script is included in the container. Run it by pointing it at the first part of any GGUF model:
 ```bash
 # Syntax
 gguf-vram-estimator.py <path-to-gguf-file> [options]
 ```
 **Key Options:**
 - `--contexts`: A space-separated list of context sizes to calculate (e.g., `--contexts 4096 16384`).
 - `--overhead`: A float value to set the estimated overhead in GiB (default: `2.0`).
 ### Practical Examples: Planning for a 128GB Strix Halo System
 The key to using a unified memory system is balancing model quality (quantization) against context length.
 #### Scenario 1: High Quality, Short Context (Coding & Chat)
 You need the highest precision for tasks that don't require massive context windows.
 **Goal:** Run the highest quality `Llama-4-Scout` model (`Q8_0`) with a standard 8k-16k context.
 ```bash
 gguf-vram-estimator.py models/llama-4-scout-17b-16e/Q8_0/Llama-4-Scout-17B-16E-Instruct-Q8_0-00001-of-00003.gguf
 ```
 ```
 --- Model 'Llama-4-Scout-17B-16E-Instruct' ---
 Max Context: 10,485,760 tokens
 Model Size: 106.67 GiB (from file size)
 Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)
 --- Memory Footprint Estimation ---
   Context Size |  Context Memory | Est. Total VRAM
 ---------------------------------------------------
          4,096 |      768.00 MiB |      109.42 GiB
          8,192 |        1.50 GiB |      110.17 GiB
         16,384 |        1.88 GiB |      110.54 GiB
 ```
 **Analysis:** The `Q8_0` model consumes **106.7 GiB**. A 16k context adds another **~1.9 GiB**, for a total of **~111 GiB**. This fits comfortably within a 128GB system.
 #### Scenario 2: Massive Context, Lower Precision (RAG & Document Analysis)
 You need to process a huge amount of text and are willing to trade some precision for a massive context window.
 **Goal:** Run `Llama-4-Scout` with a 1,000,000 token context. This requires a much smaller model quantization (`Q4_K_XL`).
 ```bash
 gguf-vram-estimator.py models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf
 ```
 ```
 --- Model 'Llama-4-Scout-17B-16E-Instruct' ---
 Max Context: 10,485,760 tokens
 Model Size: 57.74 GiB (from file size)
 Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)
 --- Memory Footprint Estimation ---
   Context Size |  Context Memory | Est. Total VRAM
 ---------------------------------------------------
        524,288 |       25.12 GiB |       84.87 GiB
      1,048,576 |       49.12 GiB |      108.87 GiB
 ```
 **Analysis:** To enable this, we use a `Q4_K_XL` model that is only **57.7 GiB**. The 1M token context adds a massive **49.1 GiB** of memory. The total, **~109 GiB**, is a tight but achievable fit on a 128GB system.
 #### Scenario 3: Fitting a Very Large Model
 **Goal:** Determine the maximum viable context for the huge `Qwen3-235B` model.
 ```bash
 gguf-vram-estimator.py models/qwen-3-235B-Q3_K-XL/UD-Q3_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q3_K_XL-00001-of-00003.gguf
 ```
 ```
 --- Model 'Qwen3-235B-A22B-Instruct-2507' ---
 Max Context: 262,144 tokens
 Model Size: 97.00 GiB (from file size)
 Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead)
 --- Memory Footprint Estimation ---
   Context Size |  Context Memory | Est. Total VRAM
 ---------------------------------------------------
         65,536 |       11.75 GiB |      110.75 GiB
        131,072 |       23.50 GiB |      122.50 GiB
        262,144 |       47.00 GiB |      146.00 GiB
 ```
 **Analysis:** The base model takes **97 GiB**. You have approximately **30 GiB** of headroom. This allows for a very large context of **~131k tokens** before exceeding the system's 128GB capacity. Attempting the full 262k context would require `146 GiB` and fail.
 > **Key Takeaway:** This tool is essential for balancing **Model Quality (Quantization)** vs. **Context Length** to fit your specific task within your system's VRAM limits.
 ## Building Containers Locally (Optional)
 If you prefer to build the containers yourself:
@@ -221,29 +326,3 @@ amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=335544321
 sudo grub2-mkconfig -o /boot/grub2/grub.cfg
 sudo reboot
 ```
 ## Troubleshooting
 ### Common Issues
 | Issue | Solution |
 |-------|----------|
 | GPU not detected | Verify `/dev/dri` and `/dev/kfd` devices exist on host |
 | Memory errors | Check that kernel parameters are properly applied |
 | Permission denied | Ensure your user is in the `video` group |
 | ROCm crashes | Try Vulkan backend instead |
 | Slow loading (>64GB models) | Use Vulkan instead of ROCm for large models |
 ### Verify GPU Access
 ```bash
 # Check devices
 ls -la /dev/dri /dev/kfd
 # Check ROCm (in ROCm containers)
 rocm-smi
 # Check Vulkan (in Vulkan container)
 vulkaninfo --summary
 ```
@@ -0,0 +1,162 @@
 #!/usr/bin/env python3
 import sys
 import os
 import re
 import struct
 import argparse
 import math
 from typing import Dict, Any, List
 # GGUF constants
 GGUF_MAGIC = 0x46554747
 GGUF_VALUE_TYPE = {
    0: "UINT8", 1: "INT8", 2: "UINT16", 3: "INT16", 4: "UINT32",
    5: "INT32", 6: "FLOAT32", 7: "BOOL", 8: "STRING", 9: "ARRAY",
 }
 class GGUFMetadataReader:
    """A minimal reader to get only the necessary KV metadata for cache calculation."""
    def __init__(self, path: str):
        self.path = path
        self.metadata: Dict[str, Any] = {}
    def read(self):
        with open(self.path, "rb") as f:
            self.f = f
            magic, _, _, metadata_kv_count = struct.unpack("<IIQQ", self.f.read(24))
            if magic != GGUF_MAGIC: raise ValueError("Invalid GGUF magic number")
            self._read_metadata(metadata_kv_count)
        return self
    def _read_string(self) -> str:
        (length,) = struct.unpack("<Q", self.f.read(8))
        return self.f.read(length).decode("utf-8", errors="replace")
    def _read_value(self, value_type_idx: int):
        value_type = GGUF_VALUE_TYPE.get(value_type_idx)
        if not value_type: raise ValueError(f"Unknown GGUF value type: {value_type_idx}")
        if value_type == "STRING": return self._read_string()
        if value_type == "UINT32": return struct.unpack("<I", self.f.read(4))[0]
        if value_type == "INT32": return struct.unpack("<i", self.f.read(4))[0]
        self._skip_value(value_type_idx)
    def _skip_value(self, value_type_idx: int):
        value_type = GGUF_VALUE_TYPE.get(value_type_idx)
        if not value_type: return
        if value_type in ("UINT8", "INT8", "BOOL"): self.f.seek(1, 1)
        elif value_type in ("UINT16", "INT16"): self.f.seek(2, 1)
        elif value_type in ("UINT32", "INT32", "FLOAT32"): self.f.seek(4, 1)
        elif value_type == "STRING":
            (length,) = struct.unpack("<Q", self.f.read(8))
            self.f.seek(length, 1)
        elif value_type == "ARRAY":
            (array_type_idx, count) = struct.unpack("<IQ", self.f.read(12))
            type_map = {0:1, 1:1, 2:2, 3:2, 4:4, 5:4, 6:4, 7:1, 10:8, 11:8, 12:8}
            element_size = type_map.get(array_type_idx)
            if element_size: self.f.seek(count * element_size, 1)
            else:
                for _ in range(count): self._skip_value(8)
    def _read_metadata(self, count: int):
        keys_to_read = {"general.architecture", "general.name"}
        arch_specific_keys_added = False
        for _ in range(count):
            key = self._read_string()
            (value_type_idx,) = struct.unpack("<I", self.f.read(4))
            if not arch_specific_keys_added and "general.architecture" in self.metadata:
                prefix = self.metadata["general.architecture"]
                keys_to_read.update({
                    f"{prefix}.block_count", f"{prefix}.context_length",
                    f"{prefix}.attention.head_count_kv", f"{prefix}.attention.key_length",
                    f"{prefix}.attention.value_length", f"{prefix}.attention.sliding_window_size"
                })
                arch_specific_keys_added = True
            if key in keys_to_read:
                self.metadata[key] = self._read_value(value_type_idx)
            else:
                self._skip_value(value_type_idx)
 def get_total_model_size_from_disk(gguf_file_path: str) -> int:
    """Calculates the total model size by finding all parts on disk."""
    match = re.search(r'-(\d{5})-of-(\d{5})\.gguf$', gguf_file_path, re.IGNORECASE)
    if not match:
        return os.path.getsize(gguf_file_path)
    base_path = gguf_file_path[:match.start()]
    total_parts_str = match.group(2)
    total_parts = int(total_parts_str)
    total_size, found_parts = 0, 0
    for i in range(1, total_parts + 1):
        part_file_name = f"{base_path}-{i:05d}-of-{total_parts_str}.gguf"
        if os.path.exists(part_file_name):
            total_size += os.path.getsize(part_file_name)
            found_parts += 1
    if found_parts != total_parts:
        print(f"WARNING: Expected {total_parts} parts, found {found_parts}. Size calculation may be incomplete.", file=sys.stderr)
    return total_size
 def format_mem(size_bytes):
    mib = size_bytes / (1024 * 1024)
    if mib < 1024: return f"{mib:8.2f} MiB"
    return f"{mib / 1024:8.2f} GiB"
 def run_estimator(gguf_file: str, context_sizes: List[int], overhead_gib: float):
    try:
        reader = GGUFMetadataReader(gguf_file).read()
        metadata = reader.metadata
        prefix = metadata.get("general.architecture")
        if not prefix: raise KeyError("Could not read 'general.architecture' from model metadata.")
        model_size_bytes = get_total_model_size_from_disk(gguf_file)
        overhead_bytes = int(overhead_gib * 1024**3)
        n_layers = metadata[f"{prefix}.block_count"]
        n_head_kv = metadata[f"{prefix}.attention.head_count_kv"]
        training_context = metadata.get(f"{prefix}.context_length", 0)
        n_embd_head_k = metadata[f"{prefix}.attention.key_length"]
        n_embd_head_v = metadata[f"{prefix}.attention.value_length"]
        swa_window_size = metadata.get(f"{prefix}.attention.sliding_window_size", 0)
        is_scout_model = "scout" in metadata.get("general.name", "").lower()
        if is_scout_model and swa_window_size == 0: n_layers_swa, n_layers_full, swa_window_size = 36, 12, 8192
        elif swa_window_size > 0: n_layers_swa, n_layers_full = n_layers, 0
        else: n_layers_swa, n_layers_full = 0, n_layers
        print(f"\n--- Model '{metadata.get('general.name', 'N/A')}' ---")
        if training_context > 0: print(f"Max Context: {training_context:,} tokens")
        print(f"Model Size: {format_mem(model_size_bytes).strip()} (from file size)")
        print(f"Incl. Overhead: {overhead_gib:.2f} GiB (for compute buffer, etc. adjustable via --overhead)")
        if training_context > 0:
            context_sizes = sorted(list(set([c for c in context_sizes if c <= training_context] + [c for c in [training_context] if c not in context_sizes])))
        else: context_sizes = sorted(context_sizes)
        bytes_per_token_per_layer = n_head_kv * (n_embd_head_k + n_embd_head_v) * 2
        print("\n--- Memory Footprint Estimation ---")
        print(f"{'Context Size':>15s} | {'Context Memory':>15s} | {'Est. Total VRAM':>15s}")
        print("-" * 51)
        for n_ctx in context_sizes:
            mem_full = n_ctx * n_layers_full * bytes_per_token_per_layer
            mem_swa = min(n_ctx, swa_window_size) * n_layers_swa * bytes_per_token_per_layer
            kv_cache_bytes = mem_full + mem_swa
            total_bytes = model_size_bytes + kv_cache_bytes + overhead_bytes
            print(f"{n_ctx:>15,} | {format_mem(kv_cache_bytes):>15s} | {format_mem(total_bytes):>15s}")
    except (FileNotFoundError, ValueError, struct.error, NotImplementedError, KeyError) as e:
        print(f"\nError: {e}", file=sys.stderr)
        sys.exit(1)
 def main():
    parser = argparse.ArgumentParser(
        description="Calculate VRAM requirements for a GGUF model, including a configurable overhead for compute buffers.",
        formatter_class=argparse.RawTextHelpFormatter
    )
    parser.add_argument("gguf_file", help="Path to the GGUF model file (any part of a multi-part model).")
    parser.add_argument("-c", "--contexts", nargs='+', type=int, default=[4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576], help="Space-separated list of context sizes to calculate.")
    parser.add_argument("--overhead", type=float, default=2.0, help="Estimated overhead in GiB for compute buffers, drivers, etc. (default: 2.0)")
    args = parser.parse_args()
    run_estimator(args.gguf_file, args.contexts, args.overhead)
 if __name__ == "__main__":
    main()