llama-server Vulkan vs vLLM ROCm benchmark on AMD RX 9070 XT (RDNA4/gfx1201) — 62 vs 48 t/s https://digtvbg.com/

Shell 100%

Find a file

ivangotoy c8f7c45e40 update repo governance		2026-04-12 13:42:00 +03:00
llm-bench.sh	Add llm-bench.sh for benchmarking LLM services	2026-03-22 11:29:11 +02:00
README.md	update repo governance	2026-04-12 13:42:00 +03:00

README.md

⚠️ Mirror. Primary repository: git.digtvbg.com Development, issues, and PRs happen there. The GitHub repo is read-only.

llama-server vs vLLM on AMD RX 9070 XT (RDNA4)

Benchmark comparing llama-server (Vulkan backend) vs vLLM (ROCm 7.2) on AMD Radeon RX 9070 XT (gfx1201 / RDNA4 architecture).

TL;DR: llama-server Vulkan delivers 62 t/s vs vLLM ROCm 48 t/s on the same hardware — 29% faster — due to missing native RDNA4 kernel support in vLLM at time of testing.

Hardware & Software

Component	Details
GPU	AMD Radeon RX 9070 XT (gfx1201 / RDNA4)
Vulkan	1.4.341.0
Mesa	mesa-git 26.1.0_devel.219852.59fc5ae7c1b-1 (RADV)
ROCm	7.2.0
OS	CachyOS (Arch-based)

Model

Backend	Model
llama-server	`Qwen3.5-9B-UD-Q6_K_XL.gguf`
vLLM	`Qwen3.5-9B-FP8`

Note: llama-server runs GGUF Q6_K quantization via Vulkan. vLLM runs FP8 quantization via ROCm — theoretically the more optimized path for AI accelerators.

Launch Flags

llama-server (Vulkan)

llama-server \
  --model ~/models/qwen35-9b/Qwen3.5-9B-UD-Q6_K_XL.gguf \
  --alias "qwen3" \
  --n-gpu-layers 999 \
  -c 65536 \
  --batch-size 2048 \
  --ubatch-size 2048 \
  --parallel 1 \
  --prio 2 \
  --host 0.0.0.0 \
  --port 8081 \
  --reasoning-budget 0 \
  --presence-penalty 1.5 \
  --metrics \
  --cache-reuse 256

vLLM (ROCm 7.2)

vllm serve ~/models/Qwen3.5-9B-FP8 \
  --served-model-name qwen3 \
  --max-model-len 32768 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8_e4m3 \
  --enable-prefix-caching \
  --language-model-only \
  --reasoning-parser qwen3 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --port 8081 \
  --host 0.0.0.0

Benchmark Methodology

Benchmark script: llm-bench.sh

Runs: 10 measured + 2 warmup (warmup discarded)
Max tokens: 512
Temperature: 0
Seed: 42
Prompt: Technical Linux kernel memory management question (consistent across both backends)
Metric: tokens/second (completion tokens / elapsed time)

Results

Backend	Avg t/s	Notes
llama-server (Vulkan)	62 t/s	GGUF Q6_K, Vulkan backend
vLLM (ROCm 7.2)	48 t/s	FP8, ROCm — missing gfx1201 kernel support

llama-server Vulkan is 29% faster on this hardware.

Root Cause

vLLM on RDNA4 (gfx1201) silently falls back to FP32 dequantization for all FP8 operations — completely bypassing the hardware's 128 AI accelerators. This happens because gfx1201 is not yet in vLLM's platform detection, so it never follows the optimized FP8 Triton kernel path.

The community workaround requires patching vllm/platforms/rocm.py to add gfx1201 to the on_mi3xx() detection and providing RDNA4-specific FP8 kernel config files.

Open upstream issue: vllm-project/vllm#28649

Conclusion

If you just bought an RX 9070 XT and your vLLM inference feels underperforming — it's not your hardware, it's the backend. Until RDNA4 support lands in vLLM mainline, llama-server with Vulkan is the better choice for local inference on this GPU.

ROCm/vLLM will catch up. The community is actively working on it.