[Tune] Add qwen3.5-397B MXFP4 a16w16 GEMM tuning configs by yichiche · Pull Request #3974 · ROCm/aiter

yichiche · 2026-06-28T06:51:05Z

Motivation

Add a16w16 (bf16) GEMM tuning configs for Qwen3.5-397B-A17B-MoE-MXFP4 (TP2). In this MXFP4 MoE model the experts run on the fp4 fused-MoE path; the remaining dense bf16 GEMMs (attention q/k/v/o, router gate, KV proj, and the MTP/EAGLE draft-layer projections) go through the a16w16 path and benefit from per-shape kernel tuning.

Technical Details

Adds two files under aiter/configs/model_configs/:

qwen3_5_397b_untuned_gemm.csv — 188 shapes to tune
qwen3_5_397b_bf16_tuned_gemm.csv — 187 tuned shapes (gfx950, cu_num=256)

Shape selection (TP2):

7 target bf16 GEMM shapes: 10240×4096, 8704×4096, 4096×4096, 4096×512, 1024×4096, 512×4096, 64×4096.
MTP / EAGLE draft-layer additions: q_proj 8192×4096 (output-gated), fc fusion / o_proj 4096×8192 (K=8192), sharded kv 256×4096.
Two 4096×512 shapes at M=1024/4096 were omitted because they already exist in kimi_bf16_tuned_gemm.csv (the repo enforces one owner per shape across bf16 configs; qwen reuses kimi's entry at runtime on the same gfx950).
M grid sized for decode + EAGLE verify (draft runs at 1×bs, target verify at num_draft_tokens×bs): 1,2,4,8,16,32,48,64,96,128,192,256,384,512,768,1024 (multiples of the runtime's 16/32 getPaddedM lookup granularity), plus a prefill tail 2048,4096,8192. 16384/32768 intentionally dropped (the gl=1 fallback caps padded_M at 8192 for N>4096).

Tuned with csrc/gemm_a16w16/gemm_a16w16_tune.py, --libtype asm,opus,flydsl,triton,torch,skinny --shape_grouped. hipBLASLt is excluded (opt-in --with-hipblaslt not used). flydsl wins ~68% of shapes.

Test Plan

End-to-end before/after serving sweep on Qwen3.5-397B-A17B-MoE-MXFP4 (TP2, --attention-backend aiter, AITER_FLYDSL_FORCE=1) via sglang.bench_serving, dataset=random, IL=8192 / OL=1024, concurrency 4→256 (num_prompts = conc×10). before = run without these tuned configs (default kernels); after = with the tuned configs.

Test Result

Before → After (tuned), IL=8192 / OL=1024:

conc	total tok/s	Δ	median e2e (ms)	Δ	p99 e2e	TPOT (ms)	Δ
4	3380 → 3481	+3.0%	9684 → 9405	−2.9%	−2.9%	10.2 → 9.8	−3.1%
8	5160 → 5302	+2.8%	12601 → 12249	−2.8%	−2.5%	13.3 → 12.9	−2.8%
16	7697 → 7935	+3.1%	17082 → 16587	−2.9%	−2.0%	17.6 → 17.1	−3.0%
32	10131 → 10333	+2.0%	25918 → 25506	−1.6%	−1.7%	27.2 → 26.7	−2.0%
64	12781 → 13068	+2.2%	41286 → 40390	−2.2%	−2.4%	43.1 → 42.2	−2.2%
128	15929 → 16141	+1.3%	66502 → 65600	−1.4%	−0.9%	69.4 → 68.5	−1.3%
256	19284 → 19306	+0.1%	110363 → 110517	+0.1%	−0.5%	114.3 → 114.2	−0.1%

Mean: +2.08% total throughput, −1.94% median e2e latency. Gains are ~2–3% at conc 4–64 (decode-time bf16 GEMMs at small M), tapering to ~0 at conc=256 (memory/KV-bound saturation). No regressions.

A prefill-bound run (IL=70000 / OL=300) showed ~0–1% as expected — that workload is attention/MoE/bandwidth-bound and the heavy expert compute is on the (untuned) MXFP4 path.

Accuracy: pure config addition — no model or runtime logic changed. During tuning each candidate kernel is validated against the reference GEMM within err_ratio ≤ 0.05, so selected kernels are numerically equivalent; no accuracy impact.

Submission Checklist

Tuning configs generated on gfx950 (MI35x), cu_num=256.
End-to-end before/after validated (IL=8192/OL=1024).
Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-06-28T06:51:19Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3974 --add-label <label>

github-actions · 2026-06-28T07:12:31Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3974 --add-label <label>

yichiche requested a review from a team June 28, 2026 06:51

[Tune] Add qwen3.5-397B MXFP4 a16w16 GEMM tuning configs

6f8e6f2

yichiche force-pushed the add-qwen3_5-397b-bf16-gemm-tuning branch from 1342ee1 to 6f8e6f2 Compare June 28, 2026 07:03

yichiche closed this Jun 28, 2026

yichiche reopened this Jun 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tune] Add qwen3.5-397B MXFP4 a16w16 GEMM tuning configs#3974

[Tune] Add qwen3.5-397B MXFP4 a16w16 GEMM tuning configs#3974
yichiche wants to merge 1 commit into
ROCm:mainfrom
yichiche:add-qwen3_5-397b-bf16-gemm-tuning

yichiche commented Jun 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yichiche commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented Jun 28, 2026

🏷️ CI Guide

Uh oh!

github-actions Bot commented Jun 28, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yichiche commented Jun 28, 2026 •

edited

Loading