Skip to content

[test] test_topk_plain: parametrize sweep to fix collection-time OOM#3934

Open
JohnQinAMD wants to merge 1 commit into
mainfrom
fix/topk-plain-collection-oom
Open

[test] test_topk_plain: parametrize sweep to fix collection-time OOM#3934
JohnQinAMD wants to merge 1 commit into
mainfrom
fix/topk-plain-collection-oom

Conversation

@JohnQinAMD

@JohnQinAMD JohnQinAMD commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Problem

CI runs each op test as a script (python3 <file> in .github/scripts/aiter_test.sh), not via pytest. test_topk_plain.py sweeps 84 cases up to 3072 × 131072 fp32 and never frees tensors between cases, so on a memory-pressured gfx950 runner it OOMs mid-sweep and the whole file exits non-zero — producing intermittent "Standard Tests (MI35X, 8, 5)" failures unrelated to the code under test. On a clean GPU it passes; the failure is purely residual-memory dependent.

Fix (keeps the python3 <file> contract)

  • Free each case's tensors (del + gc.collect() + torch.cuda.empty_cache()) so peak memory is one case, not the whole sweep — the actual OOM fix.
  • Guard the sweep under if __name__ == "__main__": so import is side-effect-free.
  • num_iters 1000 → 100 — this is a correctness check, not a perf gate; the high iteration count made it the slowest file in its shard for no benefit.
  • Vectorize the per-row permutation (drops a batch_size-long Python loop).
  • Assert on checkAllclose's returned error ratio — it does not raise, so the test previously would have passed even if topk_plain were incorrect.
  • Summary table falls back to df.to_string when tabulate is absent, so the informational summary can never fail the run.

Validation

python3 op_tests/test_topk_plain.py on MI355X (gfx950): all 84 cases run, all passed~ (err 0), summary prints, exit 0.

Addresses both Copilot review comments (script-style no-op under python3; assert on the error ratio).

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@JohnQinAMD JohnQinAMD requested review from a team and Copilot June 25, 2026 16:04
@github-actions

Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3934 --add-label <label>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors op_tests/test_topk_plain.py to avoid collection/import-time GPU OOMs caused by a module-level sweep that allocates tensors for all cases up front. It converts the sweep into isolated, lazy-per-case execution using pytest parameterization, adds teardown to free GPU memory between cases, and preserves the end-of-run performance summary.

Changes:

  • Replace the module-level sweep loop with @pytest.mark.parametrize test cases to prevent collection-time allocations and speed up collection.
  • Add an autouse fixture to GC + torch.cuda.empty_cache() between cases and a session-scoped fixture to emit a single markdown perf summary (with a safe fallback when tabulate is missing).
  • Vectorize the input permutation generation and reduce iteration counts (1000 → 20) to keep this as a correctness-focused test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread op_tests/test_topk_plain.py Outdated
Comment thread op_tests/test_topk_plain.py Outdated
@JohnQinAMD JohnQinAMD force-pushed the fix/topk-plain-collection-oom branch 4 times, most recently from b427fd5 to 7c7c7f6 Compare June 26, 2026 05:00
@zufayu zufayu self-requested a review June 26, 2026 08:53
CI runs each op test as a script (`python3 <file>` in aiter_test.sh), not via
pytest. test_topk_plain.py sweeps 84 cases up to 3072x131072 fp32; with no
cleanup between cases, a memory-pressured gfx950 runner OOMs mid-sweep and the
whole file exits non-zero -> intermittent shard failures unrelated to the code
under test.

Fix (keeps the `python3 <file>` contract):
- free each case's tensors (del + gc + torch.cuda.empty_cache()) so peak memory
  is one case, not the whole sweep
- guard the sweep under `if __name__ == "__main__":` (clean import)
- num_iters 1000 -> 100 (correctness check, not a perf gate; was the slowest
  file in its shard)
- vectorize the per-row permutation (drops a batch_size-long Python loop)
- assert on checkAllclose's returned error ratio (it does NOT raise) so an
  incorrect topk_plain actually fails CI instead of silently passing
- summary table falls back to df.to_string when `tabulate` is absent

Validated: `python3 op_tests/test_topk_plain.py` runs all 84 cases, all pass
(err 0), summary prints, exit 0 on MI355X (gfx950).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@JohnQinAMD JohnQinAMD force-pushed the fix/topk-plain-collection-oom branch from 7c7c7f6 to 9a2655e Compare June 27, 2026 05:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants