Dry Run Protocol#2961
Conversation
…mory Introduce a dry-run execution framework that replaces device and host memory resources with lightweight fake allocators to measure peak memory usage without holding real memory. New files: - dry_run_memory_resource.hpp: dry_run_allocator (lock-free bump allocator), dry_run_device_memory_resource, dry_run_host_memory_resource, dry_run_resource_manager (RAII), and dry_run_execute() helper. - dry_run_flag.hpp: boolean dry-run flag as a raft resource, allowing algorithms to skip kernel execution during profiling. - tests/util/dry_run_memory_resource.cpp: unit tests. The dry_run_allocator probes the upstream once to obtain a base address, then atomically bumps a pointer for each allocation — no mutex, no map, no real memory held after the initial probe.
…pinned_memory_resource Add pinned and managed resources to the raft::resources handle to make it possible to customize / temporarily replace these resources
…aking change due to transitive includes in downstream libraries
Merges Remove deprecated headers (NVIDIA#2939). Conflict resolutions: - rsvd.cuh: Use new mdspan-based raft::matrix::sqrt and reciprocal APIs (they have internal dry-run guards); kept cudaMemsetAsync guard - svd.cuh: Use raft::matrix::weighted_sqrt (has internal dry-run guard) - matrix.cuh: Accept deletion (deprecated, removed in main) Co-authored-by: Cursor <cursoragent@cursor.com>
|
|
||
| // Dry-run guard: raft::copy is a pure data-movement utility with no | ||
| // allocations that callers would need tracked. | ||
| if (resource::get_dry_run_flag(res)) { return; } |
There was a problem hiding this comment.
Oh, I didn't realize we were adding this as a new resource. This would make it hard to use the dry-run for pre-initializing resources.
There was a problem hiding this comment.
Yes, but that's fine! We can still push the initialized resources back to the original resources handle on destruction of the dry run resources
There was a problem hiding this comment.
NB: with #3052 , resources initialized in dry run mode will be automatically shared back with the input resources.
…substituted and recover afterwards
|
@coderabbitai review |
✅ Action performedReview finished.
|
📝 WalkthroughSummary by CodeRabbit
WalkthroughIntroduces a dry-run memory profiling protocol to RAFT. A new ChangesDry-Run Memory Profiling Protocol
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (10)
cpp/include/raft/matrix/slice.cuh (1)
52-60: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: Keep the public input validation before the dry-run exit.
Line 52 returns before the layout and bounds checks, so dry-run silently accepts slices that the real call would reject. That makes dry-run behavior diverge at the public API boundary.
Move the dry-run return after the
RAFT_EXPECTS(...)block and beforedetail::sliceMatrix(...).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/matrix/slice.cuh` around lines 52 - 60, The public validation in sliceMatrix is being skipped during dry-run because the early resource::get_dry_run_flag(handle) return comes before the RAFT_EXPECTS checks. Move the dry-run exit to after the existing layout and bounds validation in sliceMatrix, so is_row_or_column_major and the row/col extent checks always run before any return, and only detail::sliceMatrix is bypassed in dry-run mode.cpp/include/raft/sparse/convert/detail/bitset_to_csr.cuh (1)
123-147: 🩺 Stability & Availability | 🔴 CriticalCRITICAL: Wrap both CUB scan calls in
RAFT_CUDA_TRY.
cub::DeviceScan::ExclusiveSumreturnscudaError_t; leaving either call unchecked can hide a failed workspace query or scan until later stream work.Suggested fix
- cub::DeviceScan::ExclusiveSum( - nullptr, scan_ws_bytes, sub_nnz.data(), sub_nnz.data(), sub_nnz_size + 1, stream); + RAFT_CUDA_TRY(cub::DeviceScan::ExclusiveSum( + nullptr, scan_ws_bytes, sub_nnz.data(), sub_nnz.data(), sub_nnz_size + 1, stream)); @@ - cub::DeviceScan::ExclusiveSum( - scan_ws.data(), scan_ws_bytes, sub_nnz.data(), sub_nnz.data(), sub_nnz_size + 1, stream); + RAFT_CUDA_TRY(cub::DeviceScan::ExclusiveSum( + scan_ws.data(), scan_ws_bytes, sub_nnz.data(), sub_nnz.data(), sub_nnz_size + 1, stream));🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/sparse/convert/detail/bitset_to_csr.cuh` around lines 123 - 147, Both cub::DeviceScan::ExclusiveSum calls in bitset_to_csr need to be checked with RAFT_CUDA_TRY so CUDA failures are surfaced immediately. Wrap the workspace-size query call and the actual scan call in RAFT_CUDA_TRY, keeping the existing scan_ws_bytes and scan_ws usage intact, and make sure the fix is applied in the bitset-to-CSR conversion path near calc_nnz_by_rows and csr initialization.Sources: Coding guidelines, Path instructions
cpp/include/raft/spectral/detail/matrix_wrappers.hpp (1)
106-118: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy liftHIGH:
nrm1()now undercounts dry-run memory.Returning
0beforethrust::reduceskips any temporary storage that the allocator-backed Thrust policy would request, so dry-run can underestimate memory duringmodularity_matrix_tconstruction. Have you considered keeping a temp-storage sizing/probe path here instead of short-circuiting before the reduce setup? As per path instructions, "Allocations through rmm/raft memory resources must NOT be guarded—allocation attempts must still be tracked to produce accurate memory statistics."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/spectral/detail/matrix_wrappers.hpp` around lines 106 - 118, `nrm1()` is short-circuiting on dry-run before Thrust sets up its reduction, which skips allocator-backed temporary storage accounting and underestimates memory during `modularity_matrix_t` construction. Update `raft::spectral::detail::matrix_wrappers::nrm1()` to preserve the Thrust reduce/setup path for dry-run memory probing, and only avoid the actual computation if needed after temp-storage sizing has been recorded; keep the `thrust::reduce`/`thrust_policy` path reachable so rmm/raft allocations are still tracked.Source: Path instructions
cpp/include/raft/random/detail/rmat_rectangular_generator.cuh (1)
209-224: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: Dry-run bypasses
thetashape validationThis
returnnow happens before the host-sidetheta.extent(0)check, so dry-run accepts inputs that the real call would reject. That makes dry-run mask integration bugs instead of preserving the API contract.As per coding guidelines, public APIs should “validate their metadata wherever possible,” and the dry-run guard should only skip the CUDA work.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/random/detail/rmat_rectangular_generator.cuh` around lines 209 - 224, The dry-run early return in rmat_rectangular_gen is bypassing the host-side theta shape validation, so invalid metadata can slip through. Move the theta.extent(0) check so it runs before the resource::get_dry_run_flag(handle) return, while still keeping the CUDA work skipped in dry-run. Keep the existing validation logic in the rmat_rectangular_gen path and ensure dry-run preserves the same API contract as a real call.Sources: Coding guidelines, Path instructions
cpp/include/raft/stats/contingency_matrix.cuh (1)
115-128: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy liftHIGH:
get_input_class_cardinality()leaves host outputs indeterminate in dry-runThis early return skips both pointer validation and the writes to
minLabel/maxLabel. Callers use this helper to sizeout_matbeforecontingency_matrix(), so the documented workflow is no longer safe under dry-run.As per path instructions, dry-run support should keep
raft::resourcesentry points safe to call, and the developer guide still expects metadata/host-side contract checks where possible.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/stats/contingency_matrix.cuh` around lines 115 - 128, `get_input_class_cardinality()` currently returns immediately on dry-run, leaving `minLabel` and `maxLabel` unset and skipping their validation. Update this helper so the host outputs are always initialized to safe values (or otherwise deterministically populated) even when `resource::get_dry_run_flag(handle)` is true, while keeping the existing `RAFT_EXPECTS` checks and the `detail::getInputClassCardinality` path intact for normal execution. Reference the `get_input_class_cardinality`, `resource::get_dry_run_flag`, and `detail::getInputClassCardinality` symbols when adjusting the control flow.Sources: Coding guidelines, Path instructions
cpp/include/raft/stats/detail/homogeneity_score.cuh (1)
46-56: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: Dry-run now reports perfect homogeneity for non-empty inputs.
When
dry_runis true,mutual_info_score(...)andentropy(...)both short-circuit to0, so the existingcomputedEntropy == 0 ? 1.0 : ...fallback returns a perfect score without doing any work. That makes dry-run observably wrong for any caller that reads the result during profiling. Consider short-circuitingdry_runin this wrapper before the entropy-zero special case and returning a neutral sentinel such as0.0.Suggested fix
double homogeneity_score(bool dry_run, const T* truthClusterArray, const T* predClusterArray, int size, T lowerLabelRange, T upperLabelRange, cudaStream_t stream) { if (size == 0) return 1.0; + if (dry_run) return 0.0; double computedMI, computedEntropy; computedMI = mutual_info_score( dry_run, truthClusterArray, predClusterArray, size, lowerLabelRange, upperLabelRange, stream); computedEntropy = entropy(dry_run, truthClusterArray, size, lowerLabelRange, upperLabelRange, stream);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/stats/detail/homogeneity_score.cuh` around lines 46 - 56, The homogeneity wrapper is returning a perfect score during dry-run because both `mutual_info_score(...)` and `entropy(...)` can short-circuit to zero, which then triggers the `computedEntropy == 0` fallback in `homogeneity_score(...)`. Update this logic so `dry_run` is handled explicitly before the entropy-zero branch, and return a neutral sentinel such as `0.0` instead of `1.0` when no real computation is performed. Keep the change localized to the `homogeneity_score` path in this header and preserve the existing non-dry-run behavior for normal inputs.cpp/include/raft/stats/detail/meanvar.cuh (1)
203-207: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winHIGH: Dry-run still queries CUDA occupancy in the row-major path.
Line 204 and Line 207 execute CUDA/device queries before the
if (!dry_run)guard, someanvar(..., dry_run=true, rowMajor=true)still touches the CUDA runtime. That breaks the dry-run contract and can fail on setups using dry-run specifically to avoid GPU work.Suggested fix
- int occupancy; - RAFT_CUDA_TRY(cudaOccupancyMaxActiveBlocksPerMultiprocessor( - &occupancy, meanvar_kernel_rowmajor<T, I, BlockSize>, BlockSize, 0)); - gs.y = - std::min(gs.y, raft::ceildiv<decltype(gs.y)>(occupancy * getMultiProcessorCount(), gs.x)); + if (!dry_run) { + int occupancy; + RAFT_CUDA_TRY(cudaOccupancyMaxActiveBlocksPerMultiprocessor( + &occupancy, meanvar_kernel_rowmajor<T, I, BlockSize>, BlockSize, 0)); + gs.y = std::min( + gs.y, raft::ceildiv<decltype(gs.y)>(occupancy * getMultiProcessorCount(), gs.x)); + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/stats/detail/meanvar.cuh` around lines 203 - 207, The row-major dry-run path in meanvar still performs CUDA runtime queries before the dry_run check, so move the occupancy and multiprocessor-count logic behind the existing dry_run guard. Update the meanvar row-major branch in meanvar.cuh so meanvar(..., dry_run=true, rowMajor=true) skips cudaOccupancyMaxActiveBlocksPerMultiprocessor and getMultiProcessorCount entirely, only computing gs.y when not in dry-run mode.Source: Path instructions
cpp/include/raft/core/host_mdarray.hpp (1)
219-228: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: Dry-run should not suppress host-scalar initialization.
This write is host-side only, so skipping it leaves
make_host_scalar(res, v)uninitialized under dry-run and can make host-side branching or workspace sizing diverge from real execution. Consider keeping the assignment unconditional here.Suggested fix
- if (!resource::get_dry_run_flag(res)) { scalar(0) = v; } + scalar(0) = v;As per path instructions, "Any expensive function or any function involving CUDA-calls must be guarded" and dry-run APIs taking
raft::resourcesshould remain safe to call.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/host_mdarray.hpp` around lines 219 - 228, The host-side initialization in make_host_scalar is being skipped when resource::get_dry_run_flag(res) is set, which leaves the returned host_scalar uninitialized during dry-run. Update make_host_scalar in host_mdarray.hpp so the assignment to scalar(0) = v happens unconditionally after construction, while keeping any CUDA or expensive work guarded elsewhere; this preserves correct host-side behavior without breaking dry-run safety.Source: Path instructions
cpp/include/raft/core/pinned_mdarray.hpp (1)
115-122: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: Dry-run should not skip pinned-scalar initialization.
This is still a host-side write. Guarding it leaves
make_pinned_scalar(handle, v)undefined in dry-run and can change host-visible control flow without improving allocation tracking.Suggested fix
- if (!resource::get_dry_run_flag(handle)) { scalar(0) = v; } + scalar(0) = v;As per path instructions, "Any expensive function or any function involving CUDA-calls must be guarded" and dry-run APIs taking
raft::resourcesshould remain safe to call.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/pinned_mdarray.hpp` around lines 115 - 122, The dry-run guard in make_pinned_scalar is incorrectly skipping the host-side assignment to the pinned_scalar, leaving the returned scalar uninitialized in dry-run mode. Remove the conditional around the scalar initialization in make_pinned_scalar so the value is always written after constructing pinned_scalar<ElementType>, while keeping any CUDA/allocation-related behavior safe through the existing resource/policy path.Source: Path instructions
cpp/include/raft/core/managed_mdarray.hpp (1)
115-122: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: Dry-run should still initialize managed scalars.
managed_scalaris host-accessible, so guarding this write can leave the value undefined during dry-run and make any host-side logic that inspects it diverge from the real path. The allocation is already preserved, so this assignment should stay unconditional.Suggested fix
- if (!resource::get_dry_run_flag(handle)) { scalar(0) = v; } + scalar(0) = v;As per path instructions, "Any expensive function or any function involving CUDA-calls must be guarded" and dry-run APIs taking
raft::resourcesshould remain safe to call.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/managed_mdarray.hpp` around lines 115 - 122, The make_managed_scalar helper is conditionally skipping initialization during dry-run, which can leave managed_scalar<ElementType> uninitialized even though it is host-accessible. Update make_managed_scalar so the scalar(0) = v assignment is unconditional, while keeping the allocation path safe for dry-run; use the existing make_managed_scalar and managed_scalar<ElementType> symbols to locate the write and remove the resource::get_dry_run_flag(handle) guard around it.Source: Path instructions
🟠 Major comments (26)
cpp/include/raft/linalg/detail/norm.cuh-65-67 (1)
65-67: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH:
colNormCaller<LinfNorm>bypasses dry-run.Line 67 passes a hardcoded
false, so column-wiseLinfNormstill runs the reduction path during dry-run instead of only tracking allocations.Suggested fix
} else if constexpr (norm_type == LinfNorm) { reduce<rowMajor, false, Type, OutType, IdxType>( - false, dots, data, D, N, (OutType)0, stream, false, raft::abs_op(), raft::max_op(), fin_op); + dry_run, dots, data, D, N, (OutType)0, stream, false, raft::abs_op(), raft::max_op(), fin_op); } else {As per path instructions, “Any expensive function or any function involving CUDA-calls must be guarded via
resource::get_dry_run_flag(res).”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/linalg/detail/norm.cuh` around lines 65 - 67, `colNormCaller<LinfNorm>` is still invoking the reduction path during dry-run because it passes a hardcoded false into `reduce`; update this branch to follow the same dry-run guard used elsewhere by checking `resource::get_dry_run_flag(res)` before calling `reduce`. Keep the fix localized in `colNormCaller` within norm.cuh, and ensure the `LinfNorm` path only performs allocation tracking when dry-run is enabled rather than executing the CUDA reduction.Source: Path instructions
cpp/include/raft/linalg/strided_reduction.cuh-132-132 (1)
132-132: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: Dry-run masks invalid
strided_reductioncalls.Line 132 returns before the output-size checks and before the unsupported-type guard inside
stridedReduction(...), so a dry-run can succeed for a call that the real path rejects immediately. Have you considered leaving the validation/type gate in place and skipping only the final launch?As per coding guidelines, Public Interface says to “validate their metadata wherever possible.” Based on path instructions, dry-run should guard expensive/CUDA work via
resource::get_dry_run_flag(res), not bypass cheap host-side validation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/linalg/strided_reduction.cuh` at line 132, The dry-run early return in stridedReduction is skipping host-side validation, so invalid calls can appear to succeed. Keep the metadata/output-size checks and unsupported-type guard in place in stridedReduction, and use resource::get_dry_run_flag(handle) only to skip the expensive CUDA launch path after validation has already run.Sources: Coding guidelines, Path instructions
cpp/include/raft/linalg/add.cuh-105-107 (1)
105-107: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: Keep public mdspan validation outside the dry-run fast path.
Lines 107, 145, and 181 return before the contiguity/size
RAFT_EXPECTSchecks, so dry-run can report success for malformed inputs that the real execution path still rejects. Have you considered moving the dry-run guard to just before the actual kernel dispatch instead?As per coding guidelines, Public Interface says to “validate their metadata wherever possible.” Based on path instructions, dry-run should guard expensive/CUDA work via
resource::get_dry_run_flag(res), not bypass cheap host-side validation.Also applies to: 145-145, 181-181
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/linalg/add.cuh` around lines 105 - 107, The dry-run early return in add and the other public overloads is bypassing the mdspan contiguity and size RAFT_EXPECTS checks, so malformed inputs can incorrectly succeed. Move the resource::get_dry_run_flag(handle) guard in add, and the equivalent guards in the other overloads, to just before the actual kernel dispatch so host-side validation always runs while only the expensive CUDA work is skipped.Sources: Coding guidelines, Path instructions
cpp/include/raft/matrix/sqrt.cuh-37-40 (1)
37-40: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: Preserve the shape checks in dry-run mode.
Both out-of-place overloads return before
RAFT_EXPECTS(in.size() == out.size()), so dry-run hides mismatched shapes that real execution would fail on.Validate sizes first, then short-circuit before launching the math kernels.
Also applies to: 76-83
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/matrix/sqrt.cuh` around lines 37 - 40, The out-of-place sqrt overloads in the sqrt.cuh API are returning early on resource::get_dry_run_flag(handle) before the RAFT_EXPECTS size check, which lets mismatched input/output shapes slip through in dry-run mode. Move the shape validation in the affected sqrt overloads (the out-of-place paths around the seqRoot call) so RAFT_EXPECTS(in.size() == out.size()) runs before any dry-run short-circuit, then keep the dry-run return only after validation and before launching the kernel.cpp/include/raft/matrix/detail/select_k-inl.cuh-131-132 (1)
131-132: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: This dry-run guard is too deep for the sorted radix path.
When
sorted == true, the caller still launchesraft::linalg::map_offset(...)to populateoffsetsbefore reaching this return, so dry-run still performs CUDA work on the radix+sort path. That breaks the dry-run contract even though the actual segmented sort is skipped.As per path instructions, “Any expensive function or any function involving CUDA-calls must be guarded via
resource::get_dry_run_flag(res).”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/matrix/detail/select_k-inl.cuh` around lines 131 - 132, The dry-run guard in the select-k flow is placed too late, so the sorted radix path still executes CUDA work through raft::linalg::map_offset before returning. Move the resource::get_dry_run_flag(handle) check earlier in the select-k implementation, before any path that can launch CUDA work, and ensure the guard covers the sorted==true branch as well as the unsorted path. Use the select-k logic around raft::linalg::map_offset and the surrounding sorted radix handling to locate the fix.Source: Path instructions
cpp/include/raft/matrix/detail/select_warpsort.cuh-1193-1193 (1)
1193-1193: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: This early return under-reports dry-run memory for warpsort.
select_k_()now does the right thing by allocatingtmp_val/tmp_idxbefore checkingdry_run, but Line 1193 returns before that helper is reached. Any caller using this overload in dry-run will miss those allocations and get a smaller peak-memory estimate.As per path instructions, “Allocations through rmm/raft memory resources must NOT be guarded—allocation attempts must still be tracked to produce accurate memory statistics.”
Suggested fix
- if (resource::get_dry_run_flag(res)) { return; } ASSERT(k <= kMaxCapacity, "Current max k is %d (requested %d)", kMaxCapacity, k);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/matrix/detail/select_warpsort.cuh` at line 1193, The dry-run guard in the warpsort path is skipping allocation accounting, so peak-memory estimates are too low. Update the overload in select_warpsort.cuh around the select_k_() call site to avoid returning before the helper runs, and let the tmp_val/tmp_idx allocation attempts be reached even when resource::get_dry_run_flag(res) is true. Keep the dry-run behavior in the downstream logic, but do not guard the rmm/raft-backed allocations that must be tracked for memory statistics.Source: Path instructions
cpp/include/raft/sparse/solver/detail/lanczos.cuh-198-227 (1)
198-227: 🩺 Stability & Availability | 🟠 MajorHIGH: Use the RAFT Thrust policy here.
thrust::sequenceandthrust::sortonthrust::devicecan run outsidehandle’s stream, so the laterraft::matrix::gather(...)calls may read unsynchronized indices. Useraft::resource::get_thrust_policy(handle)here instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/sparse/solver/detail/lanczos.cuh` around lines 198 - 227, The index initialization and sorting in the lanczos selection path currently uses the generic thrust::device policy, which can bypass the handle stream and leave later gather operations racing unsynchronized data. Update the thrust::sequence and both thrust::sort calls in this selection block to use the RAFT Thrust policy from raft::resource::get_thrust_policy(handle), keeping the existing indices/selected_indices logic intact so all work stays ordered on the handle’s stream.Sources: Coding guidelines, Path instructions
cpp/include/raft/solver/detail/lap_functions.cuh-267-276 (1)
267-276: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: Zero-cover dry-run undercounts its peak workspace.
executeZeroCover()now returns beforezeroCoverIteration()is reached, and the dry-run branch insidezeroCoverIteration()only sizes the CSR buffers. The real path also allocatespredicates_vandaddresses_vatSP * N, so the dry-run report will miss the dominant memory in this phase.As per path instructions, "Allocations through rmm/raft memory resources must NOT be guarded—allocation attempts must still be tracked to produce accurate memory statistics," and "early-return dry-run guards don’t accidentally skip required resource setup/teardown..."
Also applies to: 353-353
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/solver/detail/lap_functions.cuh` around lines 267 - 276, The dry-run path in zero-cover is undercounting workspace because `executeZeroCover()` now exits before `zeroCoverIteration()`, and `zeroCoverIteration()`’s dry-run branch only allocates the CSR buffers while skipping the `predicates_v` and `addresses_v` allocations that dominate memory use. Update `executeZeroCover`/`zeroCoverIteration` so dry-run still performs the same allocation tracking as the real path, including the `predicates_v` and `addresses_v` `rmm::device_uvector` allocations, while still returning early before any compute; keep the behavior aligned with the existing `csr_ptrs_v`/`csr_neighbors_v` setup so memory statistics remain accurate.Source: Path instructions
cpp/include/raft/sparse/linalg/detail/spmm.hpp-90-95 (1)
90-95: 🩺 Stability & Availability | 🟠 MajorHIGH:
bufferSizeis a byte count, buttmpis allocated asValueTypeelements.
cusparsespmm_bufferSizereturns bytes, so this over-allocates the workspace bysizeof(ValueType)and skews dry-run memory accounting. Use a byte buffer or convert the count to elements before allocating.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/sparse/linalg/detail/spmm.hpp` around lines 90 - 95, The workspace allocation in the sparse SpMM path is using `bufferSize` as if it were a count of `ValueType` elements, but `cusparsespmm_bufferSize` returns bytes. Update the allocation in the `spmm`/workspace setup around `tmp` to treat the buffer as raw bytes or to convert bytes to element count before constructing the `rmm::device_uvector`, and make sure the dry-run path still accounts for the same byte-sized workspace.cpp/include/raft/linalg/detail/lstsq.cuh-213-216 (1)
213-216: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winHIGH: Clean up
gesvdj_paramson the non-dry-run path too.Issue: the dry-run branch destroys
gesvdj_params, but the normal path exits after Line 244 withoutcusolverDnDestroyGesvdjInfo.
Why: repeatedlstsqSvdJacobicalls leak cuSOLVER resources.As per path instructions,
cpp/REVIEW_GUIDELINES.mdrequires checking resource cleanup on all paths.Suggested fix
raft::linalg::gemv(handle, U, n_rows, minmn, b, Ub, true, stream); raft::linalg::binaryOp(Ub, Ub, S, minmn, DivideByNonZero<math_t>(), stream); raft::linalg::gemv(handle, V, n_cols, minmn, Ub, w, false, stream); + RAFT_CUSOLVER_TRY(cusolverDnDestroyGesvdjInfo(gesvdj_params)); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/linalg/detail/lstsq.cuh` around lines 213 - 216, The `lstsqSvdJacobi` flow cleans up `gesvdj_params` only in the `resource::get_dry_run_flag(handle)` branch, so add matching `cusolverDnDestroyGesvdjInfo(gesvdj_params)` cleanup on the normal execution path before the function returns. Make sure every exit path in `lstsq.cuh` that creates `gesvdj_params` also destroys it, using the same `gesvdj_params` symbol so repeated calls do not leak cuSOLVER resources.Source: Path instructions
cpp/include/raft/stats/detail/scores.cuh-162-184 (1)
162-184: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winHIGH: Use element counts for
device_uvectorand avoid untracked host allocation in dry-run.Issue:
device_uvector<double>sizes are element counts, son * sizeof(double)allocates 8× too many doubles; dry-run also allocatesh_sorted_abs_diffs(n)viastd::vectorbefore returning.
Why: this over-reports dry-run memory and can cause avoidable OOM.As per path instructions, dry-run must keep rmm/raft allocation tracking accurate while skipping unnecessary work.
Suggested fix
- int array_size = n * sizeof(double); - rmm::device_uvector<double> abs_diffs_array(array_size, stream); - rmm::device_uvector<double> sorted_abs_diffs(array_size, stream); - rmm::device_uvector<double> tmp_sums(2 * sizeof(double), stream); + rmm::device_uvector<double> abs_diffs_array(n, stream); + rmm::device_uvector<double> sorted_abs_diffs(n, stream); + rmm::device_uvector<double> tmp_sums(2, stream); @@ - std::vector<double> mean_errors(2); - std::vector<double> h_sorted_abs_diffs(n); int thread_cnt = 256; int block_cnt = raft::ceildiv(n, thread_cnt); if (dry_run) { return; } + + std::vector<double> mean_errors(2); + std::vector<double> h_sorted_abs_diffs(n);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/stats/detail/scores.cuh` around lines 162 - 184, The allocation logic in the scores computation is using byte counts where `rmm::device_uvector<double>` expects element counts, so `abs_diffs_array`, `sorted_abs_diffs`, and `tmp_sums` should be sized in doubles rather than `sizeof(double)` multiples; fix this in the relevant scoring routine before the CUB `SortKeys` call. Also move the host-side `std::vector<double> h_sorted_abs_diffs` allocation behind the `dry_run` early return (or otherwise skip it during dry-run) so dry-run only accounts for tracked rmm/raft device allocations and does not create untracked host memory.Source: Path instructions
cpp/include/raft/random/detail/multi_variable_gaussian.cuh-188-190 (1)
188-190: 🩺 Stability & Availability | 🟠 MajorHIGH: Guard MVG setup before the constructor does CUDA work
cpp/include/raft/random/detail/multi_variable_gaussian.cuh:146-190Issue:
give_gaussian()exits in dry-run, butbuild_multi_variable_gaussian_token_impl()still constructsmulti_variable_gaussian_impl, which creates the CURAND generator and runs cuSOLVER buffer-size calls first.
Impact: dry-run paths still execute CUDA/library setup and can fail before reaching the guard.Have you considered making the token/constructor setup dry-run-aware and deferring only the workspace accounting?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/random/detail/multi_variable_gaussian.cuh` around lines 188 - 190, The dry-run guard in give_gaussian() is too late because multi_variable_gaussian_impl is still constructed earlier by build_multi_variable_gaussian_token_impl(), which triggers CURAND and cuSOLVER setup before the early return. Make the token/constructor path dry-run-aware by checking resource::get_dry_run_flag(handle) before creating multi_variable_gaussian_impl, and defer only the workspace accounting needed for the token so dry-run execution avoids all CUDA/library initialization.Source: Path instructions
cpp/include/raft/stats/detail/adjusted_rand_index.cuh-125-150 (1)
125-150: 🩺 Stability & Availability | 🟠 MajorUse a wider type for contingency sizing
Issue:
nUniqClasses * nUniqClassesis evaluated inMathT; with the defaultMathT=int, it overflows past ~46k classes/samples.
Why: dry-run setsnUniqClasses = size, so large inputs can turn the allocation size into a bogus value before any work runs. The same risk exists in the non-dry-run path when the label range is large.
Consider computing the extent insize_t(or another wider type) before allocating the contingency buffers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/stats/detail/adjusted_rand_index.cuh` around lines 125 - 150, The contingency buffer sizing in adjusted_rand_index.cuh is computed in MathT, which can overflow when nUniqClasses is large, especially in the dry_run path where it can equal size. Update the sizing logic around the dContingencyMatrix allocation and the nUniqClasses calculation in the adjusted rand index helper to use a wider extent type such as size_t before multiplying, then cast safely for the device allocation. Keep the existing dry_run and countUnique flow intact, but ensure the final matrix element count is computed in an overflow-safe type.Sources: Coding guidelines, Path instructions
cpp/tests/core/temporary_device_buffer.cu-67-68 (1)
67-68: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: Raw-stream copy bypasses dry-run here.
This
raft::copy(..., resource::get_cuda_stream(h))call still performs real CUDA work during the dry-run pass, so the test can mutateresultwhile claiming the path is dry-run compliant. Use theraft::resourcesoverload (or guard the raw copy) inside the lambda.Suggested fix
- raft::copy( - result.data(), d_view.data_handle(), d_view.extent(0), resource::get_cuda_stream(h)); + auto result_view = raft::make_device_vector_view<int>(result.data(), d_view.extent(0)); + raft::copy(h, result_view, raft::make_const_mdspan(d_view));As per path instructions, use
raft::resourcesinstead of raw streams/handles, and guard CUDA work in dry-run mode viaresource::get_dry_run_flag(res).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/core/temporary_device_buffer.cu` around lines 67 - 68, The copy inside the temporary device buffer test is using a raw CUDA stream, which bypasses dry-run handling and can still perform real device work. Update the lambda around the raft::copy call to use the raft::resources-based overload tied to the existing resource handle, and gate the copy with resource::get_dry_run_flag(res) so the dry-run pass does not mutate result. Keep the fix localized to the temporary_device_buffer test and the lambda containing raft::copy.Source: Path instructions
cpp/tests/core/bitset.cu-331-332 (1)
331-332: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: Guard raw CUDA work inside the dry-run lambda.
Issue: Line 331 and Line 387 still launch GPU work during the dry-run pass.
Why: This violates dry-run semantics and can mask non-compliant kernels.Suggested fix
- RAFT_CUDA_TRY(cudaMemsetAsync( - repeat_device.data_handle(), 0, eval_n_elements * sizeof(bitset_t), stream)); + if (!resource::get_dry_run_flag(h)) { + RAFT_CUDA_TRY(cudaMemsetAsync(repeat_device.data_handle(), + 0, + eval_n_elements * sizeof(bitset_t), + resource::get_cuda_stream(h))); + } ... - raft::linalg::range(query_device.data_handle(), query_device.size(), stream); + if (!resource::get_dry_run_flag(h)) { + raft::linalg::range( + query_device.data_handle(), query_device.size(), resource::get_cuda_stream(h)); + }As per path instructions, “Any expensive function or any function involving CUDA-calls must be guarded via
resource::get_dry_run_flag(res).”Also applies to: 387-387
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/core/bitset.cu` around lines 331 - 332, Guard the raw CUDA work in the dry-run path so it only runs when resource::get_dry_run_flag(res) is false. Update the bitset test logic around the dry-run lambda in the relevant test helpers (including the cudaMemsetAsync call and the other CUDA launch at the referenced location) to skip these GPU operations during dry-run, keeping the lambda’s non-CUDA validation separate from device work.Source: Path instructions
cpp/tests/linalg/rsvd.cu-128-149 (1)
128-149: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: Percentage-based RSVD cases never hit the percentage API path.
Line 131 is dead here:
params.kis already rewritten from0to a concrete rank earlier inSetUp(), so the ratio-based fixtures now run the fixed-rank Jacobi variants instead ofrsvd_perc*. That drops coverage for the public percentage APIs and can hide dry-run regressions there.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/linalg/rsvd.cu` around lines 128 - 149, The percentage-based RSVD branch in the RSVD test is unreachable because `SetUp()` rewrites `params.k` before this lambda runs, so the test never exercises `rsvd_perc` or `rsvd_perc_symmetric`. Update the selection logic in the `rsvd` test around `raft::execute_with_dry_run_check` to branch on the original test mode or an explicit percentage/fixed-rank flag instead of `params.k`, so the ratio-based fixtures actually call the percentage APIs while fixed-rank cases continue to use `rsvd_fixed_rank_jacobi` and `rsvd_fixed_rank_symmetric_jacobi`.cpp/tests/sparse/csr_transpose.cu-97-113 (1)
97-113: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winHIGH: This dry-run wrapper breaks stream ordering with the test’s prepared inputs.
make_data()uploads into buffers onraft_handle/stream, but this lambda launchescsr_transposeon a differentraft::resourcesstream viaresource::get_cuda_stream(h). Without an explicit dependency, transpose can observe partially populated inputs. Have you considered runningexecute_with_dry_run_checkonraft_handleinstead so setup and compute stay ordered on the same resource? As per path instructions, verifycpp/**/*changes againstdocs/source/developer_guide.md, especially “Asynchronous operations and stream ordering”.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/sparse/csr_transpose.cu` around lines 97 - 113, The dry-run wrapper in csr_transpose test is launching work on a different stream than the one used by make_data(), which can break ordering and let csr_transpose read inputs before uploads complete. Update the execute_with_dry_run_check call to use the same raft_handle/resource context as the test setup so the prepared buffers and the transpose kernel share stream ordering, and keep the relevant resource/stream selection aligned through resource::get_cuda_stream and the surrounding test harness.Source: Path instructions
docs/source/dry_run_protocol.md-43-49 (1)
43-49: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick winThe example teaches the wrong stream ownership model.
This snippet introduces a public-style API that takes a raw
cudaStream_talongsideraft::resources, which contradicts RAFT’s dry-run/resource guidance and can mislead users away from resource-ordered streams. Showauto stream = resource::get_cuda_stream(handle);instead of accepting a separate stream parameter. As per coding guidelines, prefer “raft::resourcesover raw handles/streams.” As per path instructions, docs changes should prioritize accuracy and consistency with current APIs.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/source/dry_run_protocol.md` around lines 43 - 49, The dry-run example in algo currently models an incorrect ownership pattern by accepting a raw cudaStream_t alongside raft::resources, which conflicts with the resource-first API guidance. Update the example to derive the stream from raft::resources using resource::get_cuda_stream(handle) inside algo, and remove the separate stream parameter so the signature and call sites reflect the preferred resource-ordered model. Keep the rest of the snippet aligned with this RAFT API style and ensure the doc example stays consistent with current dry-run guidance.Sources: Coding guidelines, Path instructions
cpp/tests/stats/meanvar.cu-84-85 (1)
84-85: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick winHIGH: The allocation assertion now depends on a private
detailtype.Using
sizeof(raft::stats::detail::mean_var<T>)bakes private workspace layout into the test. That makes this suite fail on benign internal refactors instead of public dry-run regressions.As per coding guidelines, "Public API over
detail. When re-using functionality or writing tests, call the public API rather than functions in thedetailnamespace." As per path instructions, verify the code follows the idioms indocs/source/developer_guide.md, especially the "Preferred APIs and idioms" section.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/stats/meanvar.cu` around lines 84 - 85, The allocation assertion in the mean/var test is tied to the private mean_var detail type, so update the check to use the public stats API or a public dry-run/size query instead of sizeof(raft::stats::detail::mean_var<T>). Keep the existing rowMajor and params.cols logic in the test, but replace the private workspace-size dependency with the public symbol used by the mean/variance path so the test only validates supported API behavior.Sources: Coding guidelines, Path instructions
cpp/tests/stats/cov.cu-67-91 (1)
67-91: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick winHIGH: This still leaves one
coventry point unverified under dry-run.Lines 67-91 wrap the mdspan overload, but Line 106 still calls
cov<false>(handle, ..., stream)directly. That means a regression in that publicraft::resourcespath can slip through while this file appears dry-run covered.As per path instructions, "Any function taking a
raft::resourceshandle is safe to call in dry-run mode."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/stats/cov.cu` around lines 67 - 91, The dry-run coverage in cov.cu only exercises the mdspan overload, while the public raft::resources entry point cov<false>(handle, ..., stream) remains unverified. Update the test to also invoke the raft::resources-based cov path inside the dry-run check, using the existing cov and raft::execute_with_dry_run_check symbols, so both overloads are covered under dry-run and regressions in the handle-taking API are caught.Source: Path instructions
cpp/tests/util/bitonic_sort.cu-146-152 (1)
146-152: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: this dry-run callback still launches the kernel.
The lambda ignores
hand calls the raw-streambitonic_launchdirectly, so the dry-run leg still executesbitonic_kernel<<<...>>>. That means this test never actually validates “skip GPU work in dry-run mode”; it only re-runs the kernel. Have you considered guarding the launch withif (!resource::get_dry_run_flag(h))or routing through araft::resources-aware helper? As per path instructions, dry-run CUDA work must be guarded viaresource::get_dry_run_flag(res)and RAFT code should preferraft::resourcesover raw streams.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/util/bitonic_sort.cu` around lines 146 - 152, The dry-run test callback in bitonic sort is still invoking the GPU launch unconditionally, so it never actually verifies that dry-run skips work. Update the lambda passed to execute_with_dry_run_check to use the provided raft::resources argument instead of ignoring it, and guard the bitonic_launch invocation with resource::get_dry_run_flag(h) (or equivalent raft::resources-aware helper) so the kernel only runs when not in dry-run mode. Prefer threading raft::resources through the launch path rather than calling the raw-stream bitonic_launch directly.Source: Path instructions
cpp/tests/sparse/symmetrize.cu-95-104 (1)
95-104: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy liftHIGH:
outescapes the dry-run resource scope.
outis an owningCOOcreated beforeexecute_with_dry_run_check, then populated inside the callback and dereferenced afterward. If the dry-run leg resizes/fills those buffers against the temporary dry-run resources, the laterassert_symmetrylaunch can end up reading storage whose allocator/resource context has already been torn down. Have you considered keeping the owningCOOentirely inside the callback (and asserting only on the non-dry-run leg), or splitting the dry-run accounting from the functional validation? As per path instructions, early-return dry-run guards must not break required resource setup/teardown, and the developer guide requires dry-run allocation tracking to stay within the dry-run resource scope.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/sparse/symmetrize.cu` around lines 95 - 104, `out` is being created outside `execute_with_dry_run_check`, so the owning COO can outlive the dry-run resource scope and be used after the temporary allocator is torn down. Move the `raft::sparse::COO<value_t, value_idx, nnz_t> out` setup and the `symmetrize`/`assert_symmetry` validation into the non-dry-run path, or otherwise keep the owning storage entirely within the callback used by `execute_with_dry_run_check`. Use the `execute_with_dry_run_check` and `raft::sparse::linalg::symmetrize` sites as the key locations to refactor so dry-run accounting stays separate from functional validation.Source: Path instructions
cpp/include/raft/core/detail/copy.hpp-403-405 (1)
403-405: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH: This guard drops copy-side workspace allocations from dry-run.
detail::copyallocates an intermediatedevice_mdarrayin theuse_intermediate_src/use_intermediate_dstpaths. Returning here skips those allocations entirely, so dry-run under-reports peak memory for layout/dtype conversion copies.Suggested fix
- if (resource::get_dry_run_flag(res)) { return; } + if constexpr (!(config::use_intermediate_src || config::use_intermediate_dst)) { + if (resource::get_dry_run_flag(res)) { return; } + }As per path instructions, "Allocations through rmm/raft memory resources must NOT be guarded—allocation attempts must still be tracked" and dry-run guards must not skip required setup/teardown.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/detail/copy.hpp` around lines 403 - 405, The dry-run early return in detail::copy is skipping workspace allocations needed by the use_intermediate_src and use_intermediate_dst paths, so peak memory is under-reported. Remove the blanket guard at the start of copy and ensure the intermediate device_mdarray setup still runs under dry-run so allocation attempts through the memory resource are tracked; only skip the actual data movement if needed. Keep the fix localized to detail::copy and the intermediate-copy branches that allocate temporary buffers.Source: Path instructions
cpp/include/raft/core/sparse_types.hpp-181-181 (1)
181-181: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winHIGH:
initialize_sparsity()no longer updates the sparse structure.Line 181 only reallocates
c_elements_.get_elements()/view()still size fromstructure_.get_nnz(), and anyStructureType::initialize_sparsity()override is skipped, so the matrix can keep reporting the oldnnzwhile the backing storage changed.Suggested fix
- void initialize_sparsity(nnz_type nnz) { c_elements_.reallocate(nnz); }; + void initialize_sparsity(nnz_type nnz) + { + structure_.initialize_sparsity(nnz); + c_elements_.reallocate(nnz); + };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/sparse_types.hpp` at line 181, `initialize_sparsity()` in `sparse_types.hpp` only reallocates `c_elements_`, so the sparse metadata stays stale and any `StructureType::initialize_sparsity()` override is bypassed. Update `initialize_sparsity(nnz_type nnz)` to also initialize the underlying `structure_` and ensure `structure_.get_nnz()` reflects the new sparsity before `get_elements()` and `view()` are used. Preserve the structure-specific initialization path by invoking the structure’s own `initialize_sparsity` behavior instead of replacing it with a bare reallocate.Source: Path instructions
cpp/include/raft/core/dry_run_resources.hpp-54-74 (1)
54-74: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy liftHIGH: Serialize process-global resource swaps.
Issue: each active
dry_run_resourcesinstance snapshots and replaces process-global host/device resources without a lifetime lock.
Why: overlapping dry-run scopes can restore globals out of order, making unrelated allocations hit the wrong or destroyed adaptor.Suggested fix
`#include` <cstddef> `#include` <cstdint> `#include` <memory> +#include <mutex> `#include` <utility> @@ class dry_run_resources : public resources { public: explicit dry_run_resources(const resources& existing) : resources(existing), + global_resource_lock_(global_resource_mutex()), active_(!resource::get_dry_run_flag(existing)), old_host_(raft::mr::get_default_host_resource()), old_device_(rmm::mr::get_current_device_resource_ref()) @@ private: + static auto global_resource_mutex() -> std::recursive_mutex& + { + static std::recursive_mutex mutex; + return mutex; + } + + // Hold while process-global memory resources are swapped. + std::unique_lock<std::recursive_mutex> global_resource_lock_; + // Declaration order determines destruction order.As per coding guidelines, RAFT algorithms should remain thread-safe with different
raft::resourcesinstances.Also applies to: 159-192
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/dry_run_resources.hpp` around lines 54 - 74, Serialize the process-global host/device resource swap performed by dry_run_resources so overlapping active scopes cannot restore resources out of order. Update dry_run_resources’ constructor/destructor path to take and hold a shared lifetime lock around the existing snapshots and the calls to raft::mr::set_default_host_resource and rmm::mr::set_current_device_resource, and keep the lock until the base-class cleanup completes. Use the dry_run_resources symbols and the init()/destructor flow to ensure only one active instance can replace the global adaptors at a time while still allowing separate raft::resources instances to remain thread-safe.Source: Coding guidelines
cpp/include/raft/mr/dry_run_resource.hpp-80-98 (1)
80-98: 🩺 Stability & Availability | 🟠 MajorHIGH: Keep probe deallocation on the allocation stream.
probe_containerallocates the async probe with the caller’scuda::stream_ref, but its destructor always frees it oncudaStreamPerThread. For stream-ordered memory resources, that can break ordering and race with pending work on the original stream.Suggested fix
class probe_container { MR mr_; void* ptr_; std::size_t size_; std::size_t alignment_; + cuda::stream_ref stream_{cudaStreamPerThread}; @@ probe_container(MR mr, cuda::stream_ref stream, std::size_t size, std::size_t alignment = alignof(std::max_align_t)) - : mr_(std::move(mr)), ptr_(nullptr), size_(size), alignment_(alignment) + : mr_(std::move(mr)), ptr_(nullptr), size_(size), alignment_(alignment), stream_(stream) { ptr_ = mr_.allocate(stream, size_, alignment_); } @@ if (ptr_ == nullptr) return; if constexpr (cuda::mr::resource<MR>) { - mr_.deallocate(cuda::stream_ref{cudaStreamPerThread}, ptr_, size_, alignment_); + mr_.deallocate(stream_, ptr_, size_, alignment_); } else { mr_.deallocate_sync(ptr_, size_, alignment_); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/mr/dry_run_resource.hpp` around lines 80 - 98, probe_container in dry_run_resource.hpp allocates using the caller-provided cuda::stream_ref but currently deallocates on cudaStreamPerThread in the destructor, which breaks stream ordering for cuda::mr::resource types. Update probe_container to retain the allocation stream as state and use that same stream when calling mr_.deallocate, while preserving the existing synchronous deallocation path for non-resource MR implementations.Source: Path instructions
| explicit dry_run_resources(const resources& existing) | ||
| : resources(existing), | ||
| active_(!resource::get_dry_run_flag(existing)), | ||
| old_host_(raft::mr::get_default_host_resource()), | ||
| old_device_(rmm::mr::get_current_device_resource_ref()) | ||
| { | ||
| if (active_) init(); |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win
CRITICAL: Restore global resources if init() throws.
Issue: init() replaces global host/device resources, then continues through operations that can throw; if construction fails, ~dry_run_resources() is never called.
Why: the global resource can be left pointing at a dry-run adaptor that is being unwound, causing later allocations to use a dangling resource.
Suggested fix
{
- if (active_) init();
+ if (active_) {
+ try {
+ init();
+ } catch (...) {
+ resource::set_dry_run_flag(*this, false);
+ raft::mr::set_default_host_resource(old_host_);
+ rmm::mr::set_current_device_resource(old_device_);
+ resources_.clear();
+ factories_.clear();
+ throw;
+ }
+ }
}As per path instructions, dry-run guards must not skip required resource setup/teardown that can affect later non-dry-run calls.
Also applies to: 159-208
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/include/raft/core/dry_run_resources.hpp` around lines 54 - 60, The
`dry_run_resources` construction path in `init()` can leave global host/device
resources pointing at the dry-run adaptor if an exception is thrown before
`~dry_run_resources()` runs. Update the `dry_run_resources(const resources&)`
constructor and/or `init()` so the old resources saved in `old_host_` and
`old_device_` are restored immediately on any failure during setup, using an
exception-safe guard/rollback path that also covers the later setup block
referenced by the same issue.
Source: Path instructions
| rmm::device_uvector<math_t> d_eig_vectors(memUsage == COPY_INPUT ? n_rows * n_cols : 0, stream); | ||
|
|
||
| if (resource::get_dry_run_flag(handle)) { return; } | ||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win
CRITICAL: eigSelDC(COPY_INPUT) truncates from an uninitialized buffer.
When memUsage == COPY_INPUT, Lines 225-245 still copy/solve into eig_vectors, but Lines 260-265 truncate from d_eig_vectors, which this branch never populates. That turns the copy-preserving path into silent garbage output.
Suggested fix
- } else if (memUsage == COPY_INPUT) {
- raft::matrix::copy(handle,
- make_device_matrix_view<const math_t>(in, n_rows, n_cols),
- make_device_matrix_view(eig_vectors, n_rows, n_cols));
+ } else if (memUsage == COPY_INPUT) {
+ raft::matrix::copy(handle,
+ make_device_matrix_view<const math_t>(in, n_rows, n_cols),
+ make_device_matrix_view(d_eig_vectors.data(), n_rows, n_cols));
RAFT_CUSOLVER_TRY(cusolverDnsyevdx(cusolverH,
CUSOLVER_EIG_MODE_VECTOR,
CUSOLVER_EIG_RANGE_I,
CUBLAS_FILL_MODE_UPPER,
static_cast<int64_t>(n_rows),
- eig_vectors,
+ d_eig_vectors.data(),
static_cast<int64_t>(n_cols),
math_t(0.0),
math_t(0.0),
static_cast<int64_t>(n_cols - n_eig_vals + 1),
static_cast<int64_t>(n_cols),
&h_meig,
eig_vals,
d_work.data(),
lwork,
d_dev_info.data(),
stream));
}Also applies to: 224-245, 260-265
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/include/raft/linalg/detail/eig.cuh` around lines 202 - 205, The
COPY_INPUT path in eigSelDC is truncating from d_eig_vectors even though that
buffer is never populated in that branch, so the preserved-input workflow
returns garbage. Update eigSelDC so the branch that copies/solves into
eig_vectors also truncates from the populated buffer (or otherwise routes
COPY_INPUT output from eig_vectors), and keep the existing dry-run handling in
resource::get_dry_run_flag and the eigSelDC copy/solve logic consistent with the
selected memUsage path.
| if (resource::get_dry_run_flag(handle)) { return; } | ||
| RAFT_EXPECTS(d_A.extent(0) == d_A.extent(0) && d_sums.extent(1) == n_unique_keys, | ||
| "Output is not of size ncols * n_unique_keys"); |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win
CRITICAL: Output-shape check is a tautology.
d_A.extent(0) == d_A.extent(0) never validates d_sums.extent(0), so an undersized output matrix can pass this guard and the downstream kernel can write out of bounds.
Suggested fix
- RAFT_EXPECTS(d_A.extent(0) == d_A.extent(0) && d_sums.extent(1) == n_unique_keys,
+ RAFT_EXPECTS(d_sums.extent(0) == d_A.extent(0) && d_sums.extent(1) == n_unique_keys,
"Output is not of size ncols * n_unique_keys");As per path instructions, cpp/REVIEW_GUIDELINES.md says to prioritize “GPU/CUDA correctness” and “invalid memory access”. As per coding guidelines, public APIs should “validate their metadata wherever possible.”
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (resource::get_dry_run_flag(handle)) { return; } | |
| RAFT_EXPECTS(d_A.extent(0) == d_A.extent(0) && d_sums.extent(1) == n_unique_keys, | |
| "Output is not of size ncols * n_unique_keys"); | |
| if (resource::get_dry_run_flag(handle)) { return; } | |
| RAFT_EXPECTS(d_sums.extent(0) == d_A.extent(0) && d_sums.extent(1) == n_unique_keys, | |
| "Output is not of size ncols * n_unique_keys"); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/include/raft/linalg/reduce_rows_by_key.cuh` around lines 152 - 154, The
output-size guard in the reduce_rows_by_key path is incorrect because the first
extent check is a tautology and never validates the actual output buffer. Update
the RAFT_EXPECTS in reduce_rows_by_key.cuh to validate the row dimension of
d_sums against n_unique_keys (and keep the existing column check), using the
reduce_rows_by_key / d_sums / n_unique_keys symbols to locate the guard, so
undersized outputs are rejected before the kernel runs.
Sources: Coding guidelines, Path instructions
| // Query workspace size for sortPairs before dry-run check to track allocation | ||
| size_t workspace_size = 0; | ||
| cub::DeviceRadixSort::SortPairs(nullptr, | ||
| workspace_size, | ||
| expWts.data(), | ||
| sortedWts.data(), | ||
| inIdxPtr, | ||
| outIdxBuff.data(), | ||
| (int)len, | ||
| 0, | ||
| sizeof(WeightsT) * 8, | ||
| stream); | ||
| rmm::device_uvector<char> workspace(workspace_size, stream); | ||
|
|
||
| if (resource::get_dry_run_flag(handle)) { return; } | ||
| kernel_dispatch<WeightsT, IdxT, 1, SamplingParams<WeightsT, IdxT>>( | ||
| expWts.data(), len, stream, params); | ||
| ///@todo: use a more efficient partitioning scheme instead of full sort | ||
| // sort the array and pick the top sampledLen items | ||
| IdxT* outIdxPtr = outIdxBuff.data(); | ||
| rmm::device_uvector<char> workspace(0, stream); | ||
| sortPairs(workspace, expWts.data(), sortedWts.data(), inIdxPtr, outIdxPtr, (int)len, stream); | ||
| cub::DeviceRadixSort::SortPairs(workspace.data(), | ||
| workspace_size, | ||
| expWts.data(), | ||
| sortedWts.data(), | ||
| inIdxPtr, | ||
| outIdxPtr, | ||
| (int)len, | ||
| 0, | ||
| sizeof(WeightsT) * 8, | ||
| stream); |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the relevant code region and surrounding function.
sed -n '240,340p' cpp/include/raft/random/detail/rng_impl_deprecated.cuh
# Find how CUB SortPairs is handled elsewhere in the repo.
rg -n "DeviceRadixSort::SortPairs|RAFT_CUDA_TRY\\(cub::DeviceRadixSort::SortPairs|cub::DeviceRadixSort::SortPairs" cpp -g'*.{cuh,cu,h,hpp,cpp}'Repository: rapidsai/raft
Length of output: 4827
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,120p' cpp/include/raft/util/detail/cub_wrappers.cuh
printf '\n--- rng_impl.cuh ---\n'
sed -n '280,350p' cpp/include/raft/random/detail/rng_impl.cuh
printf '\n--- columnWiseSort excerpt ---\n'
sed -n '290,350p' cpp/include/raft/matrix/detail/columnWiseSort.cuhRepository: rapidsai/raft
Length of output: 7088
CRITICAL: Check the CUB radix-sort calls
cub::DeviceRadixSort::SortPairs returns an error code here, but both the workspace-size query and the actual sort drop it. A failed query can produce a bogus workspace size, and a failed sort will silently corrupt the sampled output. Use RAFT_CUDA_TRY(...) on both calls.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/include/raft/random/detail/rng_impl_deprecated.cuh` around lines 278 -
307, The two cub::DeviceRadixSort::SortPairs calls in the RNG sampling path are
ignoring the returned error status, so failures in the workspace-size query or
the actual radix sort can go unnoticed; wrap both calls in RAFT_CUDA_TRY so
errors are surfaced immediately. Update the code in the sampling logic around
the workspace_size query and the later sort that uses workspace.data() and
outIdxBuff.data(), keeping the existing kernel_dispatch flow unchanged.
Sources: Coding guidelines, Path instructions
| // Query workspace size for sortPairs before dry-run check to track allocation | ||
| size_t workspace_size = 0; | ||
| cub::DeviceRadixSort::SortPairs(nullptr, | ||
| workspace_size, | ||
| expWts.data(), | ||
| sortedWts.data(), | ||
| inIdxPtr, | ||
| outIdxBuff.data(), | ||
| (int)len, | ||
| 0, | ||
| sizeof(WeightsT) * 8, | ||
| stream); | ||
| rmm::device_uvector<char> workspace(workspace_size, stream); | ||
|
|
||
| if (dry_run) { return; } | ||
|
|
||
| RAFT_CALL_RNG_FUNC(rng_state, call_rng_kernel<1>, rng_state, stream, expWts.data(), len, params); | ||
|
|
||
| ///@todo: use a more efficient partitioning scheme instead of full sort | ||
| // sort the array and pick the top sampledLen items | ||
| IdxT* outIdxPtr = outIdxBuff.data(); | ||
| rmm::device_uvector<char> workspace(0, stream); | ||
| sortPairs(workspace, expWts.data(), sortedWts.data(), inIdxPtr, outIdxPtr, (int)len, stream); | ||
| cub::DeviceRadixSort::SortPairs(workspace.data(), | ||
| workspace_size, | ||
| expWts.data(), | ||
| sortedWts.data(), | ||
| inIdxPtr, | ||
| outIdxPtr, | ||
| (int)len, | ||
| 0, | ||
| sizeof(WeightsT) * 8, | ||
| stream); |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the target file around the cited regions and locate CUB calls in it.
git ls-files cpp/include/raft/random/detail/rng_impl.cuh
wc -l cpp/include/raft/random/detail/rng_impl.cuh
sed -n '280,360p' cpp/include/raft/random/detail/rng_impl.cuh
printf '\n---\n'
sed -n '390,470p' cpp/include/raft/random/detail/rng_impl.cuh
printf '\n---\n'
rg -n "Device(RadixSort|MergeSort|Select)::|RAFT_CUDA_TRY|cudaError_t" cpp/include/raft/random/detail/rng_impl.cuhRepository: rapidsai/raft
Length of output: 7813
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check the exact CUB APIs used and their return types in installed headers if present,
# and inspect local usage patterns around wrapper macros.
rg -n "namespace cub|DeviceRadixSort|DeviceMergeSort|DeviceSelect" /usr/include /usr/local/include 2>/dev/null | head -n 50
printf '\n--- local references ---\n'
rg -n "RAFT_CUDA_TRY\\(cub::Device|cub::Device.*SortPairs|cub::DeviceSelect::UniqueByKey" cpp -g'*.cuh' -g'*.cu' -g'*.hpp' -g'*.cpp'Repository: rapidsai/raft
Length of output: 151
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the target file around the cited regions and locate CUB calls in it.
git ls-files cpp/include/raft/random/detail/rng_impl.cuh
wc -l cpp/include/raft/random/detail/rng_impl.cuh
sed -n '280,360p' cpp/include/raft/random/detail/rng_impl.cuh
printf '\n---\n'
sed -n '390,470p' cpp/include/raft/random/detail/rng_impl.cuh
printf '\n---\n'
rg -n "Device(RadixSort|MergeSort|Select)::|RAFT_CUDA_TRY|cudaError_t" cpp/include/raft/random/detail/rng_impl.cuhRepository: rapidsai/raft
Length of output: 7813
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check the exact CUB APIs used and their return types in installed headers if present,
# and inspect local usage patterns around wrapper macros.
rg -n "namespace cub|DeviceRadixSort|DeviceMergeSort|DeviceSelect" /usr/include /usr/local/include 2>/dev/null | head -n 50
printf '\n--- local references ---\n'
rg -n "RAFT_CUDA_TRY\\(cub::Device|cub::Device.*SortPairs|cub::DeviceSelect::UniqueByKey" cpp -g'*.cuh' -g'*.cu' -g'*.hpp' -g'*.cpp'Repository: rapidsai/raft
Length of output: 151
CRITICAL: Wrap the new CUB calls with RAFT_CUDA_TRY
cpp/include/raft/random/detail/rng_impl.cuh:311-330, 408-457
The new cub::DeviceRadixSort::SortPairs, cub::DeviceMergeSort::SortPairs, and cub::DeviceSelect::UniqueByKey calls ignore their return codes. A failed workspace query or execution can leave invalid workspace sizing or stale outputs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/include/raft/random/detail/rng_impl.cuh` around lines 309 - 339, The new
CUB invocations in rng_impl.cuh are not checking failure status, so wrap each
`cub::DeviceRadixSort::SortPairs`, `cub::DeviceMergeSort::SortPairs`, and
`cub::DeviceSelect::UniqueByKey` call in `RAFT_CUDA_TRY` to surface both
workspace-query and execution errors. Update the affected logic in the RNG
helpers that perform sorting/uniquing, keeping the existing `dry_run` flow and
workspace allocation intact, but ensure any returned CUB error aborts before
stale buffers or invalid workspace sizes are used.
Sources: Coding guidelines, Path instructions
| size_t scan_ws_bytes = 0; | ||
| cub::DeviceScan::ExclusiveSum( | ||
| nullptr, scan_ws_bytes, sub_nnz.data(), sub_nnz.data(), sub_nnz_size + 1, stream); | ||
| rmm::device_uvector<char> scan_ws(scan_ws_bytes, stream); |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Map the target file and inspect the relevant region.
ast-grep outline cpp/include/raft/sparse/convert/detail/bitmap_to_csr.cuh --view expanded || true
wc -l cpp/include/raft/sparse/convert/detail/bitmap_to_csr.cuh
sed -n '280,360p' cpp/include/raft/sparse/convert/detail/bitmap_to_csr.cuhRepository: rapidsai/raft
Length of output: 3628
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look for the exact CUB usage and surrounding error handling patterns.
rg -n "DeviceScan::ExclusiveSum|RAFT_CUDA_TRY|cub::" cpp/include/raft/sparse/convert/detail/bitmap_to_csr.cuh cpp -g '!**/build/**' -g '!**/dist/**'Repository: rapidsai/raft
Length of output: 50369
CRITICAL: Wrap both cub::DeviceScan::ExclusiveSum calls
- Have you considered guarding both calls with
RAFT_CUDA_TRY? A failed workspace query or scan would otherwise be ignored and can leavesub_nnz/indptrinconsistent. - The first call is also before the dry-run check, so dry-run mode still performs CUDA work here.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/include/raft/sparse/convert/detail/bitmap_to_csr.cuh` around lines 312 -
315, Wrap both cub::DeviceScan::ExclusiveSum invocations in bitmap_to_csr.cuh
with RAFT_CUDA_TRY so both the workspace-size query and the actual scan are
checked for CUDA errors, and move or guard the initial workspace query so it is
skipped in dry-run mode; use the existing bitmap_to_csr conversion logic around
sub_nnz, indptr, and scan_ws to locate the fix.
Sources: Coding guidelines, Path instructions
| // Pre-allocate maximum distance buffer size (chunk * chunk) before dry-run guard | ||
| // to ensure allocation is tracked in dry-run mode and avoid reallocations in the loop | ||
| rmm::device_uvector<value_t> distances_buffer(chunk * chunk, stream); |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical | 🏗️ Heavy lift
CRITICAL: Reused distance buffer races across stream-pool work.
Issue: pairwise_distance writes the single distances_buffer via handle, while compute_chunked_a_b reads it on chunk_stream, and later iterations reuse the same buffer before sync_stream_pool().
Why: This can nondeterministically corrupt a/b and the final silhouette score.
Have you considered using one distance buffer per active stream/chunk, or enqueueing the distance computation and compute_chunked_a_b on the same ordered stream before reusing the buffer?
As per path instructions, dry-run guards must not accidentally affect stream lifecycle, and the developer guide requires internal-stream work to be ordered against resource::get_cuda_stream().
Also applies to: 222-243
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/include/raft/stats/detail/batched/silhouette_score.cuh` around lines 192
- 194, The shared distances buffer in the silhouette score chunking flow is
being reused across concurrent stream-pool work, which can cause races between
`pairwise_distance`, `compute_chunked_a_b`, and later iterations before
`sync_stream_pool()`. Fix this by ensuring each active stream/chunk has its own
distance buffer, or by making the distance computation and `compute_chunked_a_b`
run on the same ordered stream before the buffer is reused; keep the dry-run
guard and stream lifecycle ordering intact in `compute_chunked_a_b` and the
surrounding chunk loop.
Source: Path instructions
The dry run protocol defines a mechanism to simulate the execution of algorithms to get a precise estimate of the memory requirements for a real execution with the same parameters.
This PR:
raft::util::dry_run_execute,raft::dry_run_resources, andresource::get_dry_run_flagto let callers estimate peak memory usage of any RAFT algorithm without executing GPU work.Note for reviewers
The PR contains a lot of small tedious changes to cover all of raft library and the tests components.
Please start reading at
docs/sourceupdates to learn more about the topic and the principles guiding these changes.