[Configs] DSv3.2 gfx942 (MI325X): tuned a8w8 blockscale GEMM + FMoE configs (TP8)#3951
Open
frida-andersson wants to merge 4 commits into
Open
[Configs] DSv3.2 gfx942 (MI325X): tuned a8w8 blockscale GEMM + FMoE configs (TP8)#3951frida-andersson wants to merge 4 commits into
frida-andersson wants to merge 4 commits into
Conversation
Add gfx942/304-CU tuned configs for DeepSeek-V3.2 TP8/EP8 (topk=9), appended to the existing config schema (no full-file rewrite): - a8w8_blockscale_tuned_gemm.csv: +138 decode/prefill rows incl. split-K winners for the large-K shape (N=4608/K=7168) at M<=64. - tuned_fmoe.csv: +46 rows for the DSv3.2 MoE shape (7168/2048, topk=9), CK 2-stage; small-token rows kept on CK 2-stage to avoid the asm 1-stage decode regression. Measured: TTFT -6..-42% across workloads; decode TPOT ~flat at short context (bandwidth-bound), long-context decode gains from CU-count fix. Signed-off-by: Frida Andersson <fanderss@amd.com>
Extend the gfx942/304-CU DSv3.2 tuning with autotuned configs for four per-step a8w8 blockscale GEMM shapes that previously fell back to a generic kernel: - a8w8_blockscale_tuned_gemm.csv: +181 rows (M=1..16384) for N,K = 576,7168 (MLA kv_a_proj_with_mqa), 1536,7168, 512,7168, 7168,256. - model_configs/a8w8_blockscale_tuned_gemm_ds_v3.csv: drop 11 stale 512,7168 overrides that hardcoded one generic kernel for all M, so the newly autotuned main-CSV rows take effect. Signed-off-by: Frida Andersson <fanderss@amd.com>
… TP8 decode Adds full fine-M sweep (CU=304) for the five a8w8 blockscale GEMM shapes that previously had only coarse/default M coverage and were falling back to default tiles at decode time: 2112x7168, 7168x2048, 3072x1536, 4096x512, 256x7168 (q_a/q_b, kv_a/kv_b, o_proj). 206 net-new rows.
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
…nfigs Remove 51 stale rows from a8w8_blockscale_tuned_gemm_ds_v3.csv for shapes 256x7168, 2112x7168, 3072x1536, 4096x512, and 7168x2048. The previous commits added per-M autotuned rows for these shapes to the main a8w8_blockscale_tuned_gemm.csv. The loader merges the main config with every model_configs/*a8w8_blockscale_tuned_gemm*.csv and dedups on (M,N,K,cu_num,gfx) against the untuned key file, so keeping both copies makes the merge flag duplicate shapes and raise at load time. The retained main-CSV rows are strictly faster (lower us) for all 51 points; the dropped override rows pinned one coarse kernel across all M. This mirrors the earlier 512x7168 cleanup and leaves zero cross-file shape collisions.
yzhou103
reviewed
Jun 26, 2026
| gfx942,304,2,256,7168,ck,6,3,9.931,a8w8_blockscale_1x128x128_256x16x64x128_8x16_16x16_1x1_16x16x1_8x32x1_1x16x1x16_4_1x1_intrawave_v1,0.74,186.32,0.0 | ||
| gfx942,304,4,256,7168,ck,6,3,9.2475,a8w8_blockscale_1x128x128_256x16x64x128_8x16_16x16_1x1_16x16x1_8x32x1_1x16x1x16_4_1x1_intrawave_v1,1.59,201.75,0.0 | ||
| gfx942,304,8,256,7168,ck,6,3,10.5904,a8w8_blockscale_1x128x128_256x16x64x128_8x16_16x16_1x1_16x16x1_8x32x1_1x16x1x16_4_1x1_intrawave_v1,2.77,179.07,0.0015 | ||
| gfx942,304,16,256,7168,ck,6,3,10.8193,a8w8_blockscale_1x128x128_256x16x64x128_8x16_16x16_1x1_16x16x1_8x32x1_1x16x1x16_4_1x1_intrawave_v1,5.43,180.96,0.0 |
Contributor
There was a problem hiding this comment.
the shape of specific model should be written in configs in configs/model_configs.
Also it is not necessary to tune all the shapes, we have padded M to look up the configs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tuned a8w8 blockscale GEMM + FMoE configs for DeepSeek-V3.2 on gfx942 (MI325X), TP8. Config data only — no kernel/code changes.
Changes
aiter/configs/a8w8_blockscale_tuned_gemm.csv(+525): new per-M autotuned rows for the DSv3.2 TP8 GEMM shapes (attention-projection + MLA/dense). All 525 are new(M,N,K,cu_num,gfx)shapes — none overwrite existing rows.aiter/configs/tuned_fmoe.csv(+46): new autotuned FMoE rows for the DSv3.2 expert shapes (all new shapes).aiter/configs/model_configs/a8w8_blockscale_tuned_gemm_ds_v3.csv(−62): removes stale, coarse override rows for shapes now covered by the main CSV —512×7168,256×7168,2112×7168,3072×1536,4096×512,7168×2048.Why remove the ds_v3 overrides?
The loader (
aiter/jit/core.py) mergesa8w8_blockscale_tuned_gemm.csvwith everymodel_configs/*a8w8_blockscale_tuned_gemm*.csvand dedups on(M,N,K,cu_num,gfx)against the untuned key file. The oldds_v3rows for these shapes pinned a single coarse kernel across all M; keeping them alongside the new per-M autotuned rows would (a) trip the loader's duplicate-shape guard at load time and (b) shadow the better configs. The retained main-CSV rows are strictly faster (lowerus) for every one of the 62 affected points.Validation
a8w8_blockscaleGEMM set and the fulltuned_fmoeset (replicating the loader's dedup).