Skip to content

[hotswap][clr] Restrict HotSwap forwarding to validated source/target pairs#7715

Merged
xintin merged 2 commits into
ROCm:developfrom
xintin:users/xintin/clr-prefer-native-over-hotswap
Jun 25, 2026
Merged

[hotswap][clr] Restrict HotSwap forwarding to validated source/target pairs#7715
xintin merged 2 commits into
ROCm:developfrom
xintin:users/xintin/clr-prefer-native-over-hotswap

Conversation

@xintin

@xintin xintin commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Motivation

HotSwap foreign-ISA forwarding in FatBinaryInfo::ExtractFatBinaryUsingCOMGR is gated byamd::hotswap::kSupportedPairs: a (source -> device target) pair must be listed for a foreign source bundle to be forwarded to the HSA loader for transpilation. The forwarding branch is intentionally evaluated before the native/generic branches, so any listed target is routed through the comgr hotswap tool whenever the tool is loaded (HSA_TOOLS_LIB names libamd_comgr_hotswap_tool.so).

With gfx942 and gfx950 in the allowlist, fatbins built for those devices (which also contain a gfx1250 bundle) were force-transpiled from the gfx1250 source instead of using their already-correct native code object. That needlessly invokes the tool and crashes a large fraction of HIP workloads on gfx942.

Until per-target transpilation is validated, restrict kSupportedPairs to gfx1250 -> gfx1250 (B0 -> A0) only. gfx942 and gfx950 will be re-added as their transpilation paths are validated. No selection-order or HIP_FORCE_SPIRV_CODEOBJECT changes; the no-tool path is unchanged. Related: #7234.

JIRA ID

NA

Test Plan

Built ROCm via TheRock with the hotswap tool on gfx942 (MI300X). Ran the hip-tests catch suite (full 4191 + a 1048-test shard) with and without HSA_TOOLS_LIB, comparing pass/fail.
Will also run through the TheRock CI before merging.

Test Result

With the tool enabled, the suite shows zero new failures versus the no-tool baseline; the remaining failures are pre-existing and reproduce without the tool. No HotSwap: forwarding occurs on gfx942 (the allowlist no longer matches that target).

Submission Checklist

Copilot AI review requested due to automatic review settings June 23, 2026 22:46
@xintin xintin requested a review from a team as a code owner June 23, 2026 22:46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reorders fatbinary bundle selection in FatBinaryInfo::ExtractFatBinaryUsingCOMGR so that when the COMGR HotSwap tool is enabled, CLR loads an available native code object first (then generic), and only forwards a foreign-ISA source bundle for HotSwap transpilation if neither native nor generic matches the current device. This avoids unnecessary forwarding/transpilation that can crash workloads when a correct native code object is already present.

Changes:

  • Prefer native code object selection over HotSwap forwarding.
  • Fall back to generic code object selection next.
  • Only forward a foreign-ISA source bundle for HotSwap transpilation as a last resort (when no native/generic match).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@xintin xintin marked this pull request as draft June 23, 2026 22:55
@xintin xintin changed the title [hotswap][clr] Prefer native/generic code objects over HotSwap forwarding [hotswap][clr] [hotswap][clr] Restrict HotSwap forwarding to validated source/target pairs Jun 23, 2026
@xintin xintin changed the title [hotswap][clr] [hotswap][clr] Restrict HotSwap forwarding to validated source/target pairs [hotswap][clr] Restrict HotSwap forwarding to validated source/target pairs Jun 23, 2026
@xintin xintin force-pushed the users/xintin/clr-prefer-native-over-hotswap branch from 84595de to bc72eff Compare June 23, 2026 23:41
@xintin xintin self-assigned this Jun 23, 2026
@xintin xintin marked this pull request as ready for review June 23, 2026 23:41
@xintin xintin requested review from Copilot, lamb-j and nirmie June 23, 2026 23:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@lamb-j

lamb-j commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

We need to update the tool name to libhsa-hotswap.so

@nirmie

nirmie commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

libhsa-hotswap.so rename @lamb-j mentions will be done in #7629

… pairs

HotSwap foreign-ISA forwarding in FatBinaryInfo::ExtractFatBinaryUsingCOMGR is gated by
amd::hotswap::kSupportedPairs: a (source -> device target) pair must be listed for a
foreign source bundle to be forwarded to the HSA loader for transpilation. The forwarding
branch is intentionally evaluated before the native/generic branches, so any listed target
is routed through the comgr hotswap tool when the tool is loaded (HSA_TOOLS_LIB names
libamd_comgr_hotswap_tool.so).

Until per-target transpilation is validated, restrict kSupportedPairs to gfx1250 -> gfx1250
(B0 -> A0) only. With gfx942 and gfx950 removed from the allowlist, fatbins built for those
devices keep using their native code objects instead of being force-transpiled from a
gfx1250 source bundle -- which was crashing a large fraction of HIP workloads on gfx942.
gfx942 and gfx950 will be re-added as their transpilation paths are validated.

No selection-order or HIP_FORCE_SPIRV_CODEOBJECT changes; the no-tool path is unchanged.

Validated on gfx942 (MI300X): with the tool enabled, the full hip-tests catch suite shows
zero new failures versus the no-tool baseline; remaining failures are pre-existing and
reproduce without the tool.

ISSUE ID: ROCm#7234
@xintin xintin force-pushed the users/xintin/clr-prefer-native-over-hotswap branch from 3ccdb41 to 40fb7da Compare June 24, 2026 20:42
@xintin xintin merged commit 36e937b into ROCm:develop Jun 25, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants