Files
amd-strix-halo-toolboxes/docs/benchmarks.md
T
2025-08-17 12:32:08 +01:00

7.8 KiB
Raw Blame History

AMD Strix Halo — llama.cpp Toolboxes (Benchmarks)

Interactive results: https://kyuz0.github.io/amd-strix-halo-toolboxes/

Table of Contents


Benchmark methodology

  • pp512 — prompt processing throughput (tokens/sec, prefill)
  • tg128 — token generation throughput (tokens/sec, interactive)
  • Each backend tested twice per model: -fa 0 and -fa 1
  • Winners per model/test are margin-aware; multiple winners are possible when mean±σ overlap
  • Built from the same llama.cpp commit for consistency

Backends in this dataset: ROCm 7 RC + ROCWMMA + hipBLASLt, ROCm 7 RC (hipBLASLt), ROCm 7 RC (hipBLASLt OFF), ROCm 7 RC + ROCWMMA (hipBLASLt OFF), ROCm 6.4.3 (hipBLASLt), ROCm 6.4.3 (hipBLASLt OFF), ROCm 6.4.3 + ROCWMMA (hipBLASLt), ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF), Vulkan AMDVLK, Vulkan RADV

ROCm 7 hipBLASLt policy: Toolboxes ship with hipBLASLt enabled by default (ROCBLAS_USE_HIPBLASLT=1). The benchmark script also runs hipBLASLt OFF variants (-hblt0) to measure its effect.


Summary of current dataset (Flash Attention ON)

Placement counts

Prompt Processing (pp512)

Backend 1st 2nd 3rd
ROCm 6.4.3 + ROCWMMA (hipBLASLt) 9 5 0
ROCm 7 RC + ROCWMMA (hipBLASLt OFF) 3 3 8
Vulkan AMDVLK 3 0 2
ROCm 7 RC + ROCWMMA + hipBLASLt 1 8 4
ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) 0 0 1
Vulkan RADV 0 0 1

Token Generation (tg128)

Backend 1st 2nd 3rd
Vulkan RADV 13 0 0
ROCm 6.4.3 (hipBLASLt) 3 0 1
ROCm 6.4.3 + ROCWMMA (hipBLASLt) 1 4 3
ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) 1 2 4
ROCm 6.4.3 (hipBLASLt OFF) 1 1 1
ROCm 7 RC (hipBLASLt OFF) 1 1 1
ROCm 7 RC + ROCWMMA (hipBLASLt OFF) 1 1 1
ROCm 7 RC (hipBLASLt) 1 0 4
Vulkan AMDVLK 0 10 0
ROCm 7 RC + ROCWMMA + hipBLASLt 0 1 2

Pairwise head-to-head wins

For any model+quant where both backends succeeded, this counts who was faster (ties when equal).

Comparison Test A wins B wins Ties Total
ROCm 7 RC + ROCWMMA + hipBLASLt vs Vulkan AMDVLK pp512 11 4 0 15
ROCm 7 RC + ROCWMMA + hipBLASLt vs Vulkan AMDVLK tg128 4 10 1 15
ROCm 7 RC + ROCWMMA + hipBLASLt vs Vulkan RADV pp512 14 2 0 16
ROCm 7 RC + ROCWMMA + hipBLASLt vs Vulkan RADV tg128 3 13 0 16
Vulkan AMDVLK vs Vulkan RADV pp512 13 2 0 15
Vulkan AMDVLK vs Vulkan RADV tg128 2 13 0 15

Average ranks

Prompt Processing (pp512)

Backend Avg Rank (↓ is better)
ROCm 6.4.3 + ROCWMMA (hipBLASLt) 1.36
Vulkan AMDVLK 1.8
ROCm 7 RC + ROCWMMA + hipBLASLt 2.23
ROCm 7 RC + ROCWMMA (hipBLASLt OFF) 2.36
ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) 3.0
Vulkan RADV 3.0

Token Generation (tg128)

Backend Avg Rank (↓ is better)
Vulkan RADV 1.0
ROCm 6.4.3 (hipBLASLt) 1.5
Vulkan AMDVLK 2.0
ROCm 7 RC + ROCWMMA (hipBLASLt OFF) 2.0
ROCm 7 RC (hipBLASLt OFF) 2.0
ROCm 6.4.3 (hipBLASLt OFF) 2.0
ROCm 6.4.3 + ROCWMMA (hipBLASLt) 2.25
ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) 2.43
ROCm 7 RC (hipBLASLt) 2.6
ROCm 7 RC + ROCWMMA + hipBLASLt 2.67

Analyses by feature

Impact of Flash Attention

Median % change when Flash Attention ON vs OFF, paired by model+quant, per backend:

Backend pp512 Δ% (median, min..max, n) tg128 Δ% (median, min..max, n)
ROCm 7 RC + ROCWMMA + hipBLASLt 8.4% (3.6..65.6), n=14 -1.1% (-8.2..-0.3), n=14
ROCm 7 RC (hipBLASLt) -20.2% (-27.8..6.5), n=10 -1.4% (-8.5..3.0), n=10
ROCm 7 RC (hipBLASLt OFF) -20.4% (-28.2..-16.1), n=9 -1.9% (-8.6..0.1), n=9
ROCm 7 RC + ROCWMMA (hipBLASLt OFF) 5.8% (1.3..24.1), n=16 -1.1% (-7.4..15.1), n=16
ROCm 6.4.3 (hipBLASLt) -19.5% (-25.7..-11.9), n=12 -1.2% (-6.9..0.8), n=12
ROCm 6.4.3 (hipBLASLt OFF) -10.3% (-22.3..3.6), n=9 -1.6% (-11.1..0.0), n=9
ROCm 6.4.3 + ROCWMMA (hipBLASLt) 10.9% (3.9..25.7), n=15 -0.4% (-7.5..3.0), n=15
ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) 6.4% (1.8..12.3), n=10 -0.6% (-6.5..2.3), n=10
Vulkan AMDVLK 1.1% (-45.4..20.2), n=15 -1.5% (-28.6..0.1), n=15
Vulkan RADV 3.4% (-2.6..12.5), n=16 0.0% (-5.8..2.4), n=16

Impact of ROCWMMA

Context Test Compared Envs Pairs Median Δ%
ROCm 7 RC (hipBLASLt) pp512 ROCm 7 RC + ROCWMMA + hipBLASLt vs ROCm 7 RC (hipBLASLt) 16 16.3%
ROCm 7 RC (hipBLASLt) tg128 ROCm 7 RC + ROCWMMA + hipBLASLt vs ROCm 7 RC (hipBLASLt) 16 -0.7%
ROCm 7 RC (hipBLASLt OFF) pp512 ROCm 7 RC + ROCWMMA (hipBLASLt OFF) vs ROCm 7 RC (hipBLASLt OFF) 15 14.6%
ROCm 7 RC (hipBLASLt OFF) tg128 ROCm 7 RC + ROCWMMA (hipBLASLt OFF) vs ROCm 7 RC (hipBLASLt OFF) 15 -0.7%
ROCm 6.4.3 (hipBLASLt) pp512 ROCm 6.4.3 + ROCWMMA (hipBLASLt) vs ROCm 6.4.3 (hipBLASLt) 15 17.4%
ROCm 6.4.3 (hipBLASLt) tg128 ROCm 6.4.3 + ROCWMMA (hipBLASLt) vs ROCm 6.4.3 (hipBLASLt) 15 -0.3%
ROCm 6.4.3 (hipBLASLt OFF) pp512 ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) vs ROCm 6.4.3 (hipBLASLt OFF) 9 10.2%
ROCm 6.4.3 (hipBLASLt OFF) tg128 ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) vs ROCm 6.4.3 (hipBLASLt OFF) 9 0.3%

Impact of hipBLASLt

Context Test Compared Envs Pairs Median Δ%
ROCm 7 RC (no ROCWMMA) pp512 ROCm 7 RC (hipBLASLt) vs ROCm 7 RC (hipBLASLt OFF) 15 -0.2%
ROCm 7 RC (no ROCWMMA) tg128 ROCm 7 RC (hipBLASLt) vs ROCm 7 RC (hipBLASLt OFF) 15 -0.1%
ROCm 7 RC + ROCWMMA pp512 ROCm 7 RC + ROCWMMA + hipBLASLt vs ROCm 7 RC + ROCWMMA (hipBLASLt OFF) 16 1.4%
ROCm 7 RC + ROCWMMA tg128 ROCm 7 RC + ROCWMMA + hipBLASLt vs ROCm 7 RC + ROCWMMA (hipBLASLt OFF) 16 0.0%
ROCm 6.4.3 (no ROCWMMA) pp512 ROCm 6.4.3 (hipBLASLt) vs ROCm 6.4.3 (hipBLASLt OFF) 9 155.5%
ROCm 6.4.3 (no ROCWMMA) tg128 ROCm 6.4.3 (hipBLASLt) vs ROCm 6.4.3 (hipBLASLt OFF) 9 0.0%
ROCm 6.4.3 + ROCWMMA pp512 ROCm 6.4.3 + ROCWMMA (hipBLASLt) vs ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) 13 116.9%
ROCm 6.4.3 + ROCWMMA tg128 ROCm 6.4.3 + ROCWMMA (hipBLASLt) vs ROCm 6.4.3 + ROCWMMA (hipBLASLt OFF) 13 -0.0%

Vulkan: AMDVLK vs RADV

Head-to-head wins with selected Flash Attention filter:

Test AMDVLK wins RADV wins Ties Total
pp512 13 2 0 15
tg128 2 13 0 15

Recommendations

  • Fastest prompt processing: ROCm 6.4.3 + ROCWMMA (hipBLASLt) (most 1st-place finishes with selected Flash Attention filter).
  • Fastest token generation: Vulkan RADV (most 1st-place finishes with selected Flash Attention filter).
  • Balanced choice: ROCm 6.4.3 + ROCWMMA (hipBLASLt) (consistently near the top across PP/TG).

Winner calculation

A backend is counted as a winner if its mean throughput is within the best backends pooled ± error margin for that model/test type. This treats results within measurement noise as ties instead of false losses.