Skip to content

[Tune] Add qwen3.5-397B MXFP4 a16w16 GEMM tuning configs#3974

Open
yichiche wants to merge 1 commit into
ROCm:mainfrom
yichiche:add-qwen3_5-397b-bf16-gemm-tuning
Open

[Tune] Add qwen3.5-397B MXFP4 a16w16 GEMM tuning configs#3974
yichiche wants to merge 1 commit into
ROCm:mainfrom
yichiche:add-qwen3_5-397b-bf16-gemm-tuning

Conversation

@yichiche

@yichiche yichiche commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Motivation

Add a16w16 (bf16) GEMM tuning configs for Qwen3.5-397B-A17B-MoE-MXFP4 (TP2). In this MXFP4 MoE model the experts run on the fp4 fused-MoE path; the remaining dense bf16 GEMMs (attention q/k/v/o, router gate, KV proj, and the MTP/EAGLE draft-layer projections) go through the a16w16 path and benefit from per-shape kernel tuning.

Technical Details

Adds two files under aiter/configs/model_configs/:

  • qwen3_5_397b_untuned_gemm.csv — 188 shapes to tune
  • qwen3_5_397b_bf16_tuned_gemm.csv — 187 tuned shapes (gfx950, cu_num=256)

Shape selection (TP2):

  • 7 target bf16 GEMM shapes: 10240×4096, 8704×4096, 4096×4096, 4096×512, 1024×4096, 512×4096, 64×4096.
  • MTP / EAGLE draft-layer additions: q_proj 8192×4096 (output-gated), fc fusion / o_proj 4096×8192 (K=8192), sharded kv 256×4096.
  • Two 4096×512 shapes at M=1024/4096 were omitted because they already exist in kimi_bf16_tuned_gemm.csv (the repo enforces one owner per shape across bf16 configs; qwen reuses kimi's entry at runtime on the same gfx950).
  • M grid sized for decode + EAGLE verify (draft runs at 1×bs, target verify at num_draft_tokens×bs): 1,2,4,8,16,32,48,64,96,128,192,256,384,512,768,1024 (multiples of the runtime's 16/32 getPaddedM lookup granularity), plus a prefill tail 2048,4096,8192. 16384/32768 intentionally dropped (the gl=1 fallback caps padded_M at 8192 for N>4096).

Tuned with csrc/gemm_a16w16/gemm_a16w16_tune.py, --libtype asm,opus,flydsl,triton,torch,skinny --shape_grouped. hipBLASLt is excluded (opt-in --with-hipblaslt not used). flydsl wins ~68% of shapes.

Test Plan

End-to-end before/after serving sweep on Qwen3.5-397B-A17B-MoE-MXFP4 (TP2, --attention-backend aiter, AITER_FLYDSL_FORCE=1) via sglang.bench_serving, dataset=random, IL=8192 / OL=1024, concurrency 4→256 (num_prompts = conc×10). before = run without these tuned configs (default kernels); after = with the tuned configs.

Test Result

Before → After (tuned), IL=8192 / OL=1024:

conc total tok/s Δ median e2e (ms) Δ p99 e2e Δ TPOT (ms) Δ
4 3380 → 3481 +3.0% 9684 → 9405 −2.9% −2.9% 10.2 → 9.8 −3.1%
8 5160 → 5302 +2.8% 12601 → 12249 −2.8% −2.5% 13.3 → 12.9 −2.8%
16 7697 → 7935 +3.1% 17082 → 16587 −2.9% −2.0% 17.6 → 17.1 −3.0%
32 10131 → 10333 +2.0% 25918 → 25506 −1.6% −1.7% 27.2 → 26.7 −2.0%
64 12781 → 13068 +2.2% 41286 → 40390 −2.2% −2.4% 43.1 → 42.2 −2.2%
128 15929 → 16141 +1.3% 66502 → 65600 −1.4% −0.9% 69.4 → 68.5 −1.3%
256 19284 → 19306 +0.1% 110363 → 110517 +0.1% −0.5% 114.3 → 114.2 −0.1%

Mean: +2.08% total throughput, −1.94% median e2e latency. Gains are ~2–3% at conc 4–64 (decode-time bf16 GEMMs at small M), tapering to ~0 at conc=256 (memory/KV-bound saturation). No regressions.

A prefill-bound run (IL=70000 / OL=300) showed ~0–1% as expected — that workload is attention/MoE/bandwidth-bound and the heavy expert compute is on the (untuned) MXFP4 path.

Accuracy: pure config addition — no model or runtime logic changed. During tuning each candidate kernel is validated against the reference GEMM within err_ratio ≤ 0.05, so selected kernels are numerically equivalent; no accuracy impact.

Submission Checklist

@yichiche yichiche requested a review from a team June 28, 2026 06:51
@github-actions

Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3974 --add-label <label>

@yichiche yichiche force-pushed the add-qwen3_5-397b-bf16-gemm-tuning branch from 1342ee1 to 6f8e6f2 Compare June 28, 2026 07:03
@yichiche yichiche closed this Jun 28, 2026
@yichiche yichiche reopened this Jun 28, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3974 --add-label <label>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant