Skip to content

[Configs] DSv3.2 gfx942 (MI325X): tuned a8w8 blockscale GEMM + FMoE configs (TP8)#3951

Open
frida-andersson wants to merge 4 commits into
ROCm:mainfrom
frida-andersson:dsv32-gfx942-tuned-configs
Open

[Configs] DSv3.2 gfx942 (MI325X): tuned a8w8 blockscale GEMM + FMoE configs (TP8)#3951
frida-andersson wants to merge 4 commits into
ROCm:mainfrom
frida-andersson:dsv32-gfx942-tuned-configs

Conversation

@frida-andersson

@frida-andersson frida-andersson commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Tuned a8w8 blockscale GEMM + FMoE configs for DeepSeek-V3.2 on gfx942 (MI325X), TP8. Config data only — no kernel/code changes.

Changes

  • aiter/configs/a8w8_blockscale_tuned_gemm.csv (+525): new per-M autotuned rows for the DSv3.2 TP8 GEMM shapes (attention-projection + MLA/dense). All 525 are new (M,N,K,cu_num,gfx) shapes — none overwrite existing rows.
  • aiter/configs/tuned_fmoe.csv (+46): new autotuned FMoE rows for the DSv3.2 expert shapes (all new shapes).
  • aiter/configs/model_configs/a8w8_blockscale_tuned_gemm_ds_v3.csv (−62): removes stale, coarse override rows for shapes now covered by the main CSV — 512×7168, 256×7168, 2112×7168, 3072×1536, 4096×512, 7168×2048.

Why remove the ds_v3 overrides?

The loader (aiter/jit/core.py) merges a8w8_blockscale_tuned_gemm.csv with every model_configs/*a8w8_blockscale_tuned_gemm*.csv and dedups on (M,N,K,cu_num,gfx) against the untuned key file. The old ds_v3 rows for these shapes pinned a single coarse kernel across all M; keeping them alongside the new per-M autotuned rows would (a) trip the loader's duplicate-shape guard at load time and (b) shadow the better configs. The retained main-CSV rows are strictly faster (lower us) for every one of the 62 affected points.

Validation

  • 0 within-file duplicate keys in all three files.
  • 0 cross-file duplicate shapes after merge for both the full a8w8_blockscale GEMM set and the full tuned_fmoe set (replicating the loader's dedup).
  • Schema matches the current upstream CSV columns.

Add gfx942/304-CU tuned configs for DeepSeek-V3.2 TP8/EP8 (topk=9),
appended to the existing config schema (no full-file rewrite):
- a8w8_blockscale_tuned_gemm.csv: +138 decode/prefill rows incl. split-K
  winners for the large-K shape (N=4608/K=7168) at M<=64.
- tuned_fmoe.csv: +46 rows for the DSv3.2 MoE shape (7168/2048, topk=9),
  CK 2-stage; small-token rows kept on CK 2-stage to avoid the asm
  1-stage decode regression.

Measured: TTFT -6..-42% across workloads; decode TPOT ~flat at short
context (bandwidth-bound), long-context decode gains from CU-count fix.

Signed-off-by: Frida Andersson <fanderss@amd.com>
Extend the gfx942/304-CU DSv3.2 tuning with autotuned configs for four
per-step a8w8 blockscale GEMM shapes that previously fell back to a
generic kernel:
- a8w8_blockscale_tuned_gemm.csv: +181 rows (M=1..16384) for
  N,K = 576,7168 (MLA kv_a_proj_with_mqa), 1536,7168, 512,7168, 7168,256.
- model_configs/a8w8_blockscale_tuned_gemm_ds_v3.csv: drop 11 stale
  512,7168 overrides that hardcoded one generic kernel for all M, so the
  newly autotuned main-CSV rows take effect.

Signed-off-by: Frida Andersson <fanderss@amd.com>
… TP8 decode

Adds full fine-M sweep (CU=304) for the five a8w8 blockscale GEMM shapes
that previously had only coarse/default M coverage and were falling back to
default tiles at decode time: 2112x7168, 7168x2048, 3072x1536, 4096x512,
256x7168 (q_a/q_b, kv_a/kv_b, o_proj). 206 net-new rows.
@frida-andersson frida-andersson requested a review from a team June 26, 2026 07:05
@github-actions

Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3951 --add-label <label>

…nfigs

Remove 51 stale rows from a8w8_blockscale_tuned_gemm_ds_v3.csv for shapes
256x7168, 2112x7168, 3072x1536, 4096x512, and 7168x2048. The previous
commits added per-M autotuned rows for these shapes to the main
a8w8_blockscale_tuned_gemm.csv. The loader merges the main config with
every model_configs/*a8w8_blockscale_tuned_gemm*.csv and dedups on
(M,N,K,cu_num,gfx) against the untuned key file, so keeping both copies
makes the merge flag duplicate shapes and raise at load time.

The retained main-CSV rows are strictly faster (lower us) for all 51
points; the dropped override rows pinned one coarse kernel across all M.
This mirrors the earlier 512x7168 cleanup and leaves zero cross-file
shape collisions.
gfx942,304,2,256,7168,ck,6,3,9.931,a8w8_blockscale_1x128x128_256x16x64x128_8x16_16x16_1x1_16x16x1_8x32x1_1x16x1x16_4_1x1_intrawave_v1,0.74,186.32,0.0
gfx942,304,4,256,7168,ck,6,3,9.2475,a8w8_blockscale_1x128x128_256x16x64x128_8x16_16x16_1x1_16x16x1_8x32x1_1x16x1x16_4_1x1_intrawave_v1,1.59,201.75,0.0
gfx942,304,8,256,7168,ck,6,3,10.5904,a8w8_blockscale_1x128x128_256x16x64x128_8x16_16x16_1x1_16x16x1_8x32x1_1x16x1x16_4_1x1_intrawave_v1,2.77,179.07,0.0015
gfx942,304,16,256,7168,ck,6,3,10.8193,a8w8_blockscale_1x128x128_256x16x64x128_8x16_16x16_1x1_16x16x1_8x32x1_1x16x1x16_4_1x1_intrawave_v1,5.43,180.96,0.0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the shape of specific model should be written in configs in configs/model_configs.

Also it is not necessary to tune all the shapes, we have padded M to look up the configs.

@zufayu zufayu requested a review from yifehuan June 26, 2026 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants