Skip to content

Dev/fly pa reduce jit build#3944

Open
Bernard-Liu wants to merge 4 commits into
mainfrom
dev/fly_pa_reduce_jit_build
Open

Dev/fly pa reduce jit build#3944
Bernard-Liu wants to merge 4 commits into
mainfrom
dev/fly_pa_reduce_jit_build

Conversation

@Bernard-Liu

Copy link
Copy Markdown
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@Bernard-Liu Bernard-Liu requested review from a team and Copilot June 26, 2026 02:52
@github-actions

Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3944 --add-label <label>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Gluon paged-attention PS-reduce path to reduce runtime JIT “build wall” by eagerly compiling FlyDSL partition-count variants and tightening compile-time specialization in the FlyDSL kernel.

Changes:

  • Introduces eager precompilation of FlyDSL PS-reduce sibling variants (partition counts 1..8) on first use per config to reduce later mid-run JIT.
  • Uses flydsl.expr.const_expr for several kernel conditionals to force compile-time branching, and adjusts one arith.bitcast call to unwrap the source value.
  • Refactors PS reduce wrapper selection to try FlyDSL first (when available), then C++ PS-reduce, then Triton.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# variants so a later call landing on a different get_recommended_splits value
# hits the cache instead of JIT-compiling mid-run. Thread-safe: no global env
# toggling, and the test-and-add is locked so only one thread does the work.
sig = tuple(sorted(compile_kwargs.items()))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants