[Configs] DSv3.2 gfx942 (MI325X): tuned a8w8 blockscale GEMM + FMoE configs (TP8) by frida-andersson · Pull Request #3951 · ROCm/aiter

frida-andersson · 2026-06-26T07:05:02Z

Summary

Tuned a8w8 blockscale GEMM + FMoE configs for DeepSeek-V3.2 on gfx942 (MI325X), TP8. Config data only — no kernel/code changes.

Changes

aiter/configs/a8w8_blockscale_tuned_gemm.csv (+525): new per-M autotuned rows for the DSv3.2 TP8 GEMM shapes (attention-projection + MLA/dense). All 525 are new (M,N,K,cu_num,gfx) shapes — none overwrite existing rows.
aiter/configs/tuned_fmoe.csv (+46): new autotuned FMoE rows for the DSv3.2 expert shapes (all new shapes).
aiter/configs/model_configs/a8w8_blockscale_tuned_gemm_ds_v3.csv (−62): removes stale, coarse override rows for shapes now covered by the main CSV — 512×7168, 256×7168, 2112×7168, 3072×1536, 4096×512, 7168×2048.

Why remove the ds_v3 overrides?

The loader (aiter/jit/core.py) merges a8w8_blockscale_tuned_gemm.csv with every model_configs/*a8w8_blockscale_tuned_gemm*.csv and dedups on (M,N,K,cu_num,gfx) against the untuned key file. The old ds_v3 rows for these shapes pinned a single coarse kernel across all M; keeping them alongside the new per-M autotuned rows would (a) trip the loader's duplicate-shape guard at load time and (b) shadow the better configs. The retained main-CSV rows are strictly faster (lower us) for every one of the 62 affected points.

Validation

0 within-file duplicate keys in all three files.
0 cross-file duplicate shapes after merge for both the full a8w8_blockscale GEMM set and the full tuned_fmoe set (replicating the loader's dedup).
Schema matches the current upstream CSV columns.

Add gfx942/304-CU tuned configs for DeepSeek-V3.2 TP8/EP8 (topk=9), appended to the existing config schema (no full-file rewrite): - a8w8_blockscale_tuned_gemm.csv: +138 decode/prefill rows incl. split-K winners for the large-K shape (N=4608/K=7168) at M<=64. - tuned_fmoe.csv: +46 rows for the DSv3.2 MoE shape (7168/2048, topk=9), CK 2-stage; small-token rows kept on CK 2-stage to avoid the asm 1-stage decode regression. Measured: TTFT -6..-42% across workloads; decode TPOT ~flat at short context (bandwidth-bound), long-context decode gains from CU-count fix. Signed-off-by: Frida Andersson <fanderss@amd.com>

Extend the gfx942/304-CU DSv3.2 tuning with autotuned configs for four per-step a8w8 blockscale GEMM shapes that previously fell back to a generic kernel: - a8w8_blockscale_tuned_gemm.csv: +181 rows (M=1..16384) for N,K = 576,7168 (MLA kv_a_proj_with_mqa), 1536,7168, 512,7168, 7168,256. - model_configs/a8w8_blockscale_tuned_gemm_ds_v3.csv: drop 11 stale 512,7168 overrides that hardcoded one generic kernel for all M, so the newly autotuned main-CSV rows take effect. Signed-off-by: Frida Andersson <fanderss@amd.com>

… TP8 decode Adds full fine-M sweep (CU=304) for the five a8w8 blockscale GEMM shapes that previously had only coarse/default M coverage and were falling back to default tiles at decode time: 2112x7168, 7168x2048, 3072x1536, 4096x512, 256x7168 (q_a/q_b, kv_a/kv_b, o_proj). 206 net-new rows.

github-actions · 2026-06-26T07:05:25Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3951 --add-label <label>

…nfigs Remove 51 stale rows from a8w8_blockscale_tuned_gemm_ds_v3.csv for shapes 256x7168, 2112x7168, 3072x1536, 4096x512, and 7168x2048. The previous commits added per-M autotuned rows for these shapes to the main a8w8_blockscale_tuned_gemm.csv. The loader merges the main config with every model_configs/*a8w8_blockscale_tuned_gemm*.csv and dedups on (M,N,K,cu_num,gfx) against the untuned key file, so keeping both copies makes the merge flag duplicate shapes and raise at load time. The retained main-CSV rows are strictly faster (lower us) for all 51 points; the dropped override rows pinned one coarse kernel across all M. This mirrors the earlier 512x7168 cleanup and leaves zero cross-file shape collisions.

yzhou103 · 2026-06-26T07:25:46Z

+gfx942,304,2,256,7168,ck,6,3,9.931,a8w8_blockscale_1x128x128_256x16x64x128_8x16_16x16_1x1_16x16x1_8x32x1_1x16x1x16_4_1x1_intrawave_v1,0.74,186.32,0.0
+gfx942,304,4,256,7168,ck,6,3,9.2475,a8w8_blockscale_1x128x128_256x16x64x128_8x16_16x16_1x1_16x16x1_8x32x1_1x16x1x16_4_1x1_intrawave_v1,1.59,201.75,0.0
+gfx942,304,8,256,7168,ck,6,3,10.5904,a8w8_blockscale_1x128x128_256x16x64x128_8x16_16x16_1x1_16x16x1_8x32x1_1x16x1x16_4_1x1_intrawave_v1,2.77,179.07,0.0015
+gfx942,304,16,256,7168,ck,6,3,10.8193,a8w8_blockscale_1x128x128_256x16x64x128_8x16_16x16_1x1_16x16x1_8x32x1_1x16x1x16_4_1x1_intrawave_v1,5.43,180.96,0.0


the shape of specific model should be written in configs in configs/model_configs.

Also it is not necessary to tune all the shapes, we have padded M to look up the configs.

frida-andersson added 3 commits June 26, 2026 06:59

frida-andersson requested a review from a team June 26, 2026 07:05

yzhou103 reviewed Jun 26, 2026

View reviewed changes

zufayu requested a review from yifehuan June 26, 2026 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Configs] DSv3.2 gfx942 (MI325X): tuned a8w8 blockscale GEMM + FMoE configs (TP8)#3951

[Configs] DSv3.2 gfx942 (MI325X): tuned a8w8 blockscale GEMM + FMoE configs (TP8)#3951
frida-andersson wants to merge 4 commits into
ROCm:mainfrom
frida-andersson:dsv32-gfx942-tuned-configs

frida-andersson commented Jun 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

yzhou103 Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

frida-andersson commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Why remove the ds_v3 overrides?

Validation

Uh oh!

github-actions Bot commented Jun 26, 2026

🏷️ CI Guide

Uh oh!

yzhou103 Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

frida-andersson commented Jun 26, 2026 •

edited

Loading