From f3a4270aabc1429db2e6168d3e8bcdcd33f876a1 Mon Sep 17 00:00:00 2001 From: Donato Capitella Date: Thu, 31 Jul 2025 19:59:17 +0100 Subject: [PATCH] Updated reason for long context --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 2794458..bd2c457 100644 --- a/README.md +++ b/README.md @@ -215,7 +215,7 @@ Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead) ``` **Analysis:** The `Q8_0` model consumes **106.7 GiB**. A 16k context adds another **~1.9 GiB**, for a total of **~111 GiB**. This fits comfortably within a 128GB system. -#### Scenario 2: Massive Context, Lower Precision (RAG & Document Analysis) +#### Scenario 2: Large Context, Lower Precision (Long Document/Data/Code Analysis, Back-and-Forth Feedback) ```bash gguf-vram-estimator.py models/llama-4-scout-17b-16e/Q4_K_XL/Llama-4-Scout-17B-16E-Instruct-UD-Q4_K_XL-00001-of-00002.gguf @@ -232,7 +232,7 @@ Incl. Overhead: 2.00 GiB (for compute buffer, etc. adjustable via --overhead) 524,288 | 25.12 GiB | 84.87 GiB 1,048,576 | 49.12 GiB | 108.87 GiB ``` -**Analysis:** To enable this, we use a `Q4_K_XL` model that is only **57.7 GiB**. The 1M token context adds a massive **49.1 GiB** of memory. The total, **~109 GiB**, is a tight but achievable fit on a 128GB system. +**Analysis:** To enable this, we use the `Q4_K_XL` quantization of Llama-4-Scout that is only **57.7 GiB**. The 1M token context adds a massive **49.1 GiB** of memory. The total, **~109 GiB**, is a tight but achievable fit on a 128GB system. #### Scenario 3: Fitting a Very Large Model