neclean up of legacy toolboxes, removal of rocwmma and renamed rocm7-alpha to rocm-7nightlies. Added new benchmarks

This commit is contained in:
Donato Capitella
2026-01-10 10:31:04 +00:00
parent f0e9bc8865
commit 783998589e
1155 changed files with 20997 additions and 27513 deletions
+33 -20
View File
@@ -27,9 +27,6 @@
<button id="rocwmma-modal-open" type="button" class="chip small legend-pill legend-pill-rocwmma">
rocWMMA
</button>
<button id="rocwmma-impr-modal-open" type="button" class="chip small legend-pill legend-pill-rocwmma-improved">
rocWMMA-improved
</button>
</div>
</div>
</header>
@@ -89,13 +86,19 @@
<div class="modal-content">
<button id="hipblas-modal-close" class="modal-close" aria-label="Close dialog">×</button>
<h2 id="hipblas-title">hipBLASLt &amp; hblt0 explained</h2>
<p>The ROCm toolboxes ship with <code>ROCBLAS_USE_HIPBLASLT=1</code> by default. This forces rocBLAS to prefer
the hipBLASLt kernel library, which historically delivered the best throughput on gfx1151 (Strix Halo).</p>
<p>Rows tagged with <code>__hblt0</code> were re-run with <code>ROCBLAS_USE_HIPBLASLT=0</code>, letting rocBLAS
auto-select between hipBLASLt, Tensile, or other kernel providers. These runs show how performance shifts when
<p>The ROCm toolboxes ship with <code>ROCBLAS_USE_HIPBLASLT=1</code> by default. This forces rocBLAS to
prefer
the hipBLASLt kernel library, which historically delivered the best throughput on gfx1151 (Strix Halo).
</p>
<p>Rows tagged with <code>__hblt0</code> were re-run with <code>ROCBLAS_USE_HIPBLASLT=0</code>, letting
rocBLAS
auto-select between hipBLASLt, Tensile, or other kernel providers. These runs show how performance
shifts when
the tuned hipBLASLt path is disabled.</p>
<p>hipBLASLt is AMD's LT (low-level tuned) matmul backend, optimized for transformer workloads. Disabling it can
expose regressions or improvements depending on driver versions, so both configurations are published for
<p>hipBLASLt is AMD's LT (low-level tuned) matmul backend, optimized for transformer workloads. Disabling it
can
expose regressions or improvements depending on driver versions, so both configurations are published
for
comparison.</p>
</div>
</div>
@@ -104,11 +107,15 @@
<div class="modal-content">
<button id="rpc-modal-close" class="modal-close" aria-label="Close dialog">×</button>
<h2 id="rpc-title">RPC · dual server</h2>
<p>These results were produced with two Strix Halo systems (Framework Desktop + HP G1a workstation, each 128&nbsp;GB)
<p>These results were produced with two Strix Halo systems (Framework Desktop + HP G1a workstation, each
128&nbsp;GB)
connected over 5&nbsp;Gbps Ethernet. One runs <code>rpc-server</code> from llama.cpp; the other runs
<code>llama-bench --rpc</code>.</p>
<p>This setup allows distributed inference, splitting large GGUF models across both machines. The metric shows what
you can expect when latency is limited by the network and the workload is balanced between two RPC participants.</p>
<code>llama-bench --rpc</code>.
</p>
<p>This setup allows distributed inference, splitting large GGUF models across both machines. The metric
shows what
you can expect when latency is limited by the network and the workload is balanced between two RPC
participants.</p>
</div>
</div>
@@ -116,21 +123,27 @@
<div class="modal-content">
<button id="rocwmma-modal-close" class="modal-close" aria-label="Close dialog">×</button>
<h2 id="rocwmma-title">rocWMMA variants</h2>
<p>Backends labeled <code>-rocwmma</code> are rebuilt with AMD's rocWMMA library, which unlocks matrix multiply
<p>Backends labeled <code>-rocwmma</code> are rebuilt with AMD's rocWMMA library, which unlocks matrix
multiply
pipelines accelerated via wave matrix multiply-accumulate (WMMA) instructions.</p>
<p>rocWMMA kernels can significantly accelerate BF16/F16 workloads on RDNA3 but may trade stability or memory
<p>rocWMMA kernels can significantly accelerate BF16/F16 workloads on RDNA3 but may trade stability or
memory
usage; comparing plain toolboxes against <code>-rocwmma</code> ones highlights the benefit or cost.</p>
</div>
</div>
<div id="rocwmma-impr-modal" class="modal hidden" role="dialog" aria-modal="true" aria-labelledby="rocwmma-impr-title">
<div id="rocwmma-impr-modal" class="modal hidden" role="dialog" aria-modal="true"
aria-labelledby="rocwmma-impr-title">
<div class="modal-content">
<button id="rocwmma-impr-modal-close" class="modal-close" aria-label="Close dialog">×</button>
<h2 id="rocwmma-impr-title">rocWMMA-improved builds</h2>
<p>Toolboxes tagged <code>-rocwmma-improved</code> bake in an experimental llama.cpp patch that retunes rocWMMA
<p>Toolboxes tagged <code>-rocwmma-improved</code> bake in an experimental llama.cpp patch that retunes
rocWMMA
kernels for long-context throughput on Strix Halo.</p>
<p>Patch reference: <a href="https://github.com/hjc4869/llama.cpp/commit/12bb5c371bd3c647ef75e8e13de9e311edba604d"
target="_blank" rel="noreferrer">12bb5c371bd3</a>. These builds often run faster for 32k+ contexts, but
<p>Patch reference: <a
href="https://github.com/hjc4869/llama.cpp/commit/12bb5c371bd3c647ef75e8e13de9e311edba604d"
target="_blank" rel="noreferrer">12bb5c371bd3</a>. These builds often run faster for 32k+ contexts,
but
the changes are not upstream and may be unstable.</p>
</div>
</div>
@@ -138,4 +151,4 @@
<script src="assets/index2.js" type="module"></script>
</body>
</html>
</html>
+16282 -24347
View File
File diff suppressed because it is too large Load Diff