[hotswap][clr] Restrict HotSwap forwarding to validated source/target pairs#7715
Merged
xintin merged 2 commits intoJun 25, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Reorders fatbinary bundle selection in FatBinaryInfo::ExtractFatBinaryUsingCOMGR so that when the COMGR HotSwap tool is enabled, CLR loads an available native code object first (then generic), and only forwards a foreign-ISA source bundle for HotSwap transpilation if neither native nor generic matches the current device. This avoids unnecessary forwarding/transpilation that can crash workloads when a correct native code object is already present.
Changes:
- Prefer native code object selection over HotSwap forwarding.
- Fall back to generic code object selection next.
- Only forward a foreign-ISA source bundle for HotSwap transpilation as a last resort (when no native/generic match).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
84595de to
bc72eff
Compare
Contributor
|
We need to update the tool name to libhsa-hotswap.so |
This was referenced Jun 24, 2026
lamb-j
approved these changes
Jun 24, 2026
gandryey
approved these changes
Jun 24, 2026
nirmie
approved these changes
Jun 24, 2026
Contributor
… pairs HotSwap foreign-ISA forwarding in FatBinaryInfo::ExtractFatBinaryUsingCOMGR is gated by amd::hotswap::kSupportedPairs: a (source -> device target) pair must be listed for a foreign source bundle to be forwarded to the HSA loader for transpilation. The forwarding branch is intentionally evaluated before the native/generic branches, so any listed target is routed through the comgr hotswap tool when the tool is loaded (HSA_TOOLS_LIB names libamd_comgr_hotswap_tool.so). Until per-target transpilation is validated, restrict kSupportedPairs to gfx1250 -> gfx1250 (B0 -> A0) only. With gfx942 and gfx950 removed from the allowlist, fatbins built for those devices keep using their native code objects instead of being force-transpiled from a gfx1250 source bundle -- which was crashing a large fraction of HIP workloads on gfx942. gfx942 and gfx950 will be re-added as their transpilation paths are validated. No selection-order or HIP_FORCE_SPIRV_CODEOBJECT changes; the no-tool path is unchanged. Validated on gfx942 (MI300X): with the tool enabled, the full hip-tests catch suite shows zero new failures versus the no-tool baseline; remaining failures are pre-existing and reproduce without the tool. ISSUE ID: ROCm#7234
3ccdb41 to
40fb7da
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
HotSwap foreign-ISA forwarding in
FatBinaryInfo::ExtractFatBinaryUsingCOMGRis gated byamd::hotswap::kSupportedPairs: a(source -> device target)pair must be listed for a foreign source bundle to be forwarded to the HSA loader for transpilation. The forwarding branch is intentionally evaluated before the native/generic branches, so any listed target is routed through the comgr hotswap tool whenever the tool is loaded (HSA_TOOLS_LIBnameslibamd_comgr_hotswap_tool.so).With
gfx942andgfx950in the allowlist, fatbins built for those devices (which also contain agfx1250bundle) were force-transpiled from thegfx1250source instead of using their already-correct native code object. That needlessly invokes the tool and crashes a large fraction of HIP workloads on gfx942.Until per-target transpilation is validated, restrict
kSupportedPairstogfx1250 -> gfx1250(B0 -> A0) only.gfx942andgfx950will be re-added as their transpilation paths are validated. No selection-order orHIP_FORCE_SPIRV_CODEOBJECTchanges; the no-tool path is unchanged. Related: #7234.JIRA ID
NA
Test Plan
Built ROCm via TheRock with the hotswap tool on gfx942 (MI300X). Ran the hip-tests catch suite (full 4191 + a 1048-test shard) with and without
HSA_TOOLS_LIB, comparing pass/fail.Will also run through the TheRock CI before merging.
Test Result
With the tool enabled, the suite shows zero new failures versus the no-tool baseline; the remaining failures are pre-existing and reproduce without the tool. No
HotSwap: forwardingoccurs on gfx942 (the allowlist no longer matches that target).Submission Checklist