warpspeed run_to_run deterministic scan for SM90 using atomic global counter#9565
warpspeed run_to_run deterministic scan for SM90 using atomic global counter#9565srinivasyadav18 wants to merge 1 commit into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (7)
🚧 Files skipped from review as they are similar to previous changes (6)
OverviewThis pull request extends the CUB warpspeed run-to-run deterministic implementation to support SM90 GPUs by introducing an atomic global counter mechanism for tile scheduling. The changes enable deterministic scan operations on SM90 while preserving the existing clusterlaunch-based scheduling optimizations for SM100+. Key ChangesLook-Ahead Tile State Bounds (
|
| Layer / File(s) | Summary |
|---|---|
Policy and dispatch wiring cub/cub/device/dispatch/tuning/tuning_scan.cuh, cub/cub/device/dispatch/dispatch_scan.cuh, c/parallel/src/scan.cu |
Stable-order lookahead gating shifts to SM90+, dispatch allocates tile-state plus atomic-counter storage, and init/scan kernel arguments are rewired to carry both pointers with adjusted grid sizing. |
Kernel argument structure and init kernel cub/cub/device/dispatch/kernels/kernel_scan.cuh |
Lookahead kernel arguments bundle tile-state and atomic-counter pointers in lookahead_tile_state_arg_t, and init kernel initializes tile states plus resets the atomic counter. |
Kernel scheduling and tile iteration cub/cub/device/dispatch/kernels/kernel_scan_lookahead.cuh |
Lookahead kernel params carry an atomic counter, non-SM100 scheduling claims tiles with atomicAdd, and squad dispatch now runs with explicit numTiles bounds and updated per-iteration tile-index handling. |
Warpspeed bounds and masked-store fallback cub/cub/detail/warpspeed/look_ahead.cuh, cub/cub/detail/warpspeed/squad/load_store.cuh |
Lookahead helper calls propagate num_tiles into aggregate load/store bounds checks, and masked bulk-store paths use target-selected SM100 PTX instruction or byte-wise fallback copies on other targets. |
Assessment against linked issues
| Objective | Addressed | Explanation |
|---|---|---|
Extend warpspeed run-to-run deterministic implementation to SM90 by enabling the lookahead path on SM90 hardware and adjusting stable-order gating [#9334] |
✅ | |
Provide SM90-compatible lookahead scheduling and runtime support for scan kernels [#9334] |
✅ | |
Preserve correct tile-bound handling in lookahead aggregate load/store paths [#9334] |
✅ |
Possibly related PRs
- NVIDIA/cccl#9471: Also modifies
cub/cub/detail/warpspeed/look_ahead.cuhin the incremental lookahead path—the main PR threads explicitnum_tilesfor bounds checks, while this PR updates aggregate-lane mask construction.
Suggested reviewers
- miscco
- davebayer
- fbusato
Comment @coderabbitai help to get the list of available commands.
This comment has been minimized.
This comment has been minimized.
1a42062 to
e2bb592
Compare
This comment has been minimized.
This comment has been minimized.
miscco
left a comment
There was a problem hiding this comment.
I am really not sure whether this is the right approach as it greatly affects the performance of warpspeed for all architectures.
I
Performance ReportH100_80GB_HBM3bench.exclusive_scan.deterministic (avg %Diff: +61.27%) mainly on large problem sizesH200bench.exclusive_scan.deterministic (avg %Diff: +40.93%) mainly on large problem sizesRTX_PRO_6000_Blackwell_Server_Editionbench.exclusive_scan.deterministic (avg %Diff: -0.36%)bench.exclusive_scan.sum (avg %Diff: +0.17%)bench.exclusive_scan.look_ahead.sum (avg %Diff: +0.68%)bench.exclusive_scan.custom (avg %Diff: +0.12%)B200bench.exclusive_scan.deterministic (avg %Diff: -0.18%)bench.exclusive_scan.sum (avg %Diff: +0.14%)bench.exclusive_scan.look_ahead.sum (avg %Diff: +0.32%)bench.exclusive_scan.custom (avg %Diff: +0.24%) |
🥳 CI Workflow Results🟩 Finished in 2d 15h: Pass: 100%/291 | Total: 12d 00h | Max: 2h 23m | Hits: 19%/976660See results here. |
| return error; | ||
| } | ||
|
|
||
| const int scan_grid_dim = atomic_scheduling ? ::cuda::std::min(sm_count, num_tiles) : num_tiles; |
There was a problem hiding this comment.
suggestion: try using occupancy or higher CTA count per SM if it improves performance.
| } | ||
|
|
||
| void* d_tile_state = allocations[0]; | ||
| ::cuda::std::uint32_t* d_atomic_counter = static_cast<::cuda::std::uint32_t*>(allocations[1]); |
There was a problem hiding this comment.
critical: add a test checking alignment, as previous implementation doesn't enforce alignment on temporary allocations.
There was a problem hiding this comment.
I think an assertion should be fine. Each temporary storage allocation is 256 byte aligned.
| @@ -1129,7 +1141,7 @@ CUB_RUNTIME_FUNCTION _CCCL_HOST _CCCL_FORCEINLINE cudaError_t invoke_lookahead( | |||
| } | |||
|
|
|||
| // TODO(bgruber): we probably need to ensure alignment of d_temp_storage | |||
There was a problem hiding this comment.
suggestion: Drop once a test is added, as we already enforce aligned allocation of temporaries above.
There was a problem hiding this comment.
Can be dropped today.
| /*size*/ 16, | ||
| byteMaskStart); | ||
| } | ||
| NV_IF_ELSE_TARGET( |
There was a problem hiding this comment.
suggestion: scalar copy is duplicated in 3 cases here, consider a helper function reducing code duplication.
There was a problem hiding this comment.
agreed. I guess with "scalar" you mean the 16-byte masked copy?
| (const int rank = squad.threadRank(); if (rank < 16 && ((byteMaskStart >> rank) & 1u)) { | ||
| cpAsyncOobInfo.ptrGmemStartAlignDown[rank] = srcSmem[rank]; | ||
| })); |
There was a problem hiding this comment.
Important: Please put additional braces around the macro arguments:
| (const int rank = squad.threadRank(); if (rank < 16 && ((byteMaskStart >> rank) & 1u)) { | |
| cpAsyncOobInfo.ptrGmemStartAlignDown[rank] = srcSmem[rank]; | |
| })); | |
| ({const int rank = squad.threadRank(); if (rank < 16 && ((byteMaskStart >> rank) & 1u)) { | |
| cpAsyncOobInfo.ptrGmemStartAlignDown[rank] = srcSmem[rank]; | |
| }})); |
This helps clang-format.
bernhardmgruber
left a comment
There was a problem hiding this comment.
This PR looks like it is not only adding a deterministic scan for SM90, but also the non-deterministic version. Please point this out clearly in the PR description and update the PR title.
Otherwise, this is looking pretty good already! Please make sure that the Blackwell kernels are not impacts (no SASS changes on the scan benchmarks for SM100)
| NV_IF_ELSE_TARGET( | ||
| NV_PROVIDES_SM_100, | ||
| (if (::cuda::ptx::elect_sync(~0)) { | ||
| ::cuda::ptx::cp_async_bulk_cp_mask( | ||
| ::cuda::ptx::space_global, | ||
| ::cuda::ptx::space_shared, | ||
| cpAsyncOobInfo.ptrGmemStartAlignDown, | ||
| srcSmem, | ||
| /*size*/ 16, | ||
| byteMaskStart); | ||
| }), | ||
| (const int rank = squad.threadRank(); if (rank < 16 && ((byteMaskStart >> rank) & 1u)) { | ||
| cpAsyncOobInfo.ptrGmemStartAlignDown[rank] = srcSmem[rank]; | ||
| })); |
There was a problem hiding this comment.
Important question: Since you introduced code paths that access SMEM using regular loads and stores, is the line above:
::cuda::ptx::fence_proxy_async(::cuda::ptx::space_shared);
still valid? We need to acquire SMEM in the async proxy for the bulk copy, but now we are mixing bulk copies and regular loads from SMEM. Is this still legal?
@ahendriksen may be able to help here.
| (const int rank = squad.threadRank(); | ||
| const ::cuda::std::byte* tail_smem_source = ptrSmemMiddle + cpAsyncOobInfo.underCopySizeBytes; | ||
| if (rank < 16 && ((byteMaskEnd >> rank) & 1u)) { | ||
| cpAsyncOobInfo.ptrGmemEndAlignDown[rank] = tail_smem_source[rank]; | ||
| })); |
There was a problem hiding this comment.
Important: we compute the copy masks based on offsets and counts, so it seems counter intuitive to use them for the fallback loops here. We should just use the information we used to create the masks in the first place.
Here, this would be (I think):
| (const int rank = squad.threadRank(); | |
| const ::cuda::std::byte* tail_smem_source = ptrSmemMiddle + cpAsyncOobInfo.underCopySizeBytes; | |
| if (rank < 16 && ((byteMaskEnd >> rank) & 1u)) { | |
| cpAsyncOobInfo.ptrGmemEndAlignDown[rank] = tail_smem_source[rank]; | |
| })); | |
| (const int rank = squad.threadRank(); | |
| const ::cuda::std::byte* tail_smem_source = ptrSmemMiddle + cpAsyncOobInfo.underCopySizeBytes; | |
| if (rank < cpAsyncOobInfo.smemEndBytesAfter16BBoundary) { | |
| cpAsyncOobInfo.ptrGmemEndAlignDown[rank] = tail_smem_source[rank]; | |
| })); |
Applies to the other occurrences of this logic as well.
|
|
||
| # pragma unroll 1 | ||
| while (true) | ||
| while (idxTile < numTiles) |
There was a problem hiding this comment.
Question: Are we 100% certain this does not change SASS for any kernel on SM100? Why do we even need this, since we have nextIdxTileValid = static_cast<int>(regNextBlockIdx.x) < numTiles; later to exit the loop?
| refNextBlockIdxR.setFenceLdsToAsyncProxy(); | ||
| } | ||
| bool nextIdxTileValid = ::cuda::ptx::clusterlaunchcontrol_query_cancel_is_canceled(regNextBlockIdx); | ||
| bool nextIdxTileValid = false; |
There was a problem hiding this comment.
| bool nextIdxTileValid = false; | |
| bool nextIdxTileValid; |
| int idxTile; | ||
| NV_IF_ELSE_TARGET(NV_PROVIDES_SM_100, (idxTile = specialRegisters.blockIdxX;), ({ | ||
| __shared__ int s_first_tile; | ||
| if (specialRegisters.threadIdxX == 0) | ||
| { | ||
| s_first_tile = static_cast<int>(::atomicAdd(params.atomicCounter, 1u)); | ||
| } | ||
| __syncthreads(); | ||
| idxTile = s_first_tile; | ||
| })); |
There was a problem hiding this comment.
Suggestion: Consider using an IILE (immediately invoked lambda expression) here.
| bool nextIdxTileValid = false; | ||
| NV_IF_ELSE_TARGET( | ||
| NV_PROVIDES_SM_100, | ||
| (nextIdxTileValid = ::cuda::ptx::clusterlaunchcontrol_query_cancel_is_canceled(regNextBlockIdx);), | ||
| (nextIdxTileValid = static_cast<int>(regNextBlockIdx.x) < numTiles;)); |
There was a problem hiding this comment.
Suggestion: Consider an IILE here.
| } | ||
| if (cc >= ::cuda::compute_capability{9, 0} && require_stable_reduction_order) | ||
| { | ||
| return get_sm100_fallback_lookahead_policy(); |
There was a problem hiding this comment.
Important: Please add a TODO comment here that we should tune for Hopper:
| return get_sm100_fallback_lookahead_policy(); | |
| // TODO(srinivas): tune for Hopper, using Blackwell default tunings for now. | |
| return get_sm100_fallback_lookahead_policy(); |
| } | ||
|
|
||
| void* d_tile_state = allocations[0]; | ||
| ::cuda::std::uint32_t* d_atomic_counter = static_cast<::cuda::std::uint32_t*>(allocations[1]); |
There was a problem hiding this comment.
I think an assertion should be fine. Each temporary storage allocation is 256 byte aligned.
| @@ -1129,7 +1141,7 @@ CUB_RUNTIME_FUNCTION _CCCL_HOST _CCCL_FORCEINLINE cudaError_t invoke_lookahead( | |||
| } | |||
|
|
|||
| // TODO(bgruber): we probably need to ensure alignment of d_temp_storage | |||
There was a problem hiding this comment.
Can be dropped today.
| if (d_temp_storage == nullptr) | ||
| { | ||
| return cudaSuccess; | ||
| } |
There was a problem hiding this comment.
Important: The check here is correct, but where did the num_items == 0 and early return go? This should be retained.
Description
closes #9334
Checklist