Skip to content

Dry Run Protocol#2961

Open
achirkin wants to merge 94 commits into
NVIDIA:mainfrom
achirkin:fea-dry-run-protocol
Open

Dry Run Protocol#2961
achirkin wants to merge 94 commits into
NVIDIA:mainfrom
achirkin:fea-dry-run-protocol

Conversation

@achirkin

@achirkin achirkin commented Feb 20, 2026

Copy link
Copy Markdown
Contributor

The dry run protocol defines a mechanism to simulate the execution of algorithms to get a precise estimate of the memory requirements for a real execution with the same parameters.

#include <raft/util/dry_run_memory_resource.hpp>

raft::resources res;
// auto my_function(const raft::resources& res, my_args...);
auto stats = raft::util::dry_run_execute(res, my_function, my_args...);
// stats.device_global  – peak device memory (bytes)

This PR:

  • Introduces new infrastructure: raft::util::dry_run_execute, raft::dry_run_resources, and resource::get_dry_run_flag to let callers estimate peak memory usage of any RAFT algorithm without executing GPU work.
  • Makes all public functions across all raft namespaces dry-run compliant: allocations are always visible to the tracker; CUDA work is skipped.
  • Adds a small user guide (docs/source/dry_run_protocol.md)

Note for reviewers
The PR contains a lot of small tedious changes to cover all of raft library and the tests components.
Please start reading at docs/source updates to learn more about the topic and the principles guiding these changes.

…mory

Introduce a dry-run execution framework that replaces device and host
memory resources with lightweight fake allocators to measure peak memory
usage without holding real memory.

New files:
- dry_run_memory_resource.hpp: dry_run_allocator (lock-free bump
  allocator), dry_run_device_memory_resource, dry_run_host_memory_resource,
  dry_run_resource_manager (RAII), and dry_run_execute() helper.
- dry_run_flag.hpp: boolean dry-run flag as a raft resource, allowing
  algorithms to skip kernel execution during profiling.
- tests/util/dry_run_memory_resource.cpp: unit tests.

The dry_run_allocator probes the upstream once to obtain a base address,
then atomically bumps a pointer for each allocation — no mutex, no map,
no real memory held after the initial probe.
…pinned_memory_resource

Add pinned and managed resources to the raft::resources handle to make it possible to customize / temporarily replace these resources
@achirkin achirkin self-assigned this Feb 20, 2026
@achirkin achirkin requested review from a team as code owners February 20, 2026 12:30
@achirkin achirkin added feature request New feature or request breaking Breaking change labels Feb 20, 2026
@achirkin achirkin moved this to In Progress in Unstructured Data Processing Feb 20, 2026
Merges Remove deprecated headers (NVIDIA#2939). Conflict resolutions:
- rsvd.cuh: Use new mdspan-based raft::matrix::sqrt and reciprocal APIs
  (they have internal dry-run guards); kept cudaMemsetAsync guard
- svd.cuh: Use raft::matrix::weighted_sqrt (has internal dry-run guard)
- matrix.cuh: Accept deletion (deprecated, removed in main)

Co-authored-by: Cursor <cursoragent@cursor.com>

// Dry-run guard: raft::copy is a pure data-movement utility with no
// allocations that callers would need tracked.
if (resource::get_dry_run_flag(res)) { return; }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't realize we were adding this as a new resource. This would make it hard to use the dry-run for pre-initializing resources.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that's fine! We can still push the initialized resources back to the original resources handle on destruction of the dry run resources

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: with #3052 , resources initialized in dry run mode will be automatically shared back with the input resources.

@achirkin

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added dry-run support across many compute, matrix, sparse, and statistics APIs so operations can estimate memory usage without performing full GPU work.
    • Introduced new dry-run resource helpers and execution utilities for tracking peak/current memory usage.
  • Bug Fixes

    • Skipped unnecessary copies, synchronizations, and output updates during dry runs to improve accuracy and avoid side effects.
  • Documentation

    • Added guidance for dry-run usage and developer rules, plus API docs for the new resource utilities.
  • Tests

    • Added broad coverage for dry-run behavior and allocation tracking across core, linalg, sparse, stats, and utility workflows.

Walkthrough

Introduces a dry-run memory profiling protocol to RAFT. A new DRY_RUN_FLAG resource type, dry_run_flag_resource, dry_run_resource memory adaptor, and dry_run_resources RAII wrapper swap real allocators for counter-tracking adaptors. reallocate() is added to container policies. A dry_run flag is propagated through ~150 headers across linalg, matrix, stats, random, sparse, and spectral subsystems, guarding CUDA kernel launches, cuBLAS/cuSOLVER calls, and stream synchronization. Tests are updated to verify allocation prediction using a new execute_with_dry_run_check harness.

Changes

Dry-Run Memory Profiling Protocol

Layer / File(s) Summary
Core dry-run resource infrastructure
cpp/include/raft/core/resource/resource_types.hpp, cpp/include/raft/core/resource/dry_run_flag.hpp, cpp/include/raft/core/resource/cuda_stream.hpp, cpp/include/raft/mr/dry_run_resource.hpp, cpp/include/raft/core/device_container_policy.hpp, cpp/include/raft/core/host_container_policy.hpp, cpp/include/raft/core/dry_run_resources.hpp
Adds DRY_RUN_FLAG enum, dry_run_flag_resource/factory, get/set_dry_run_flag, sync_stream no-op under dry-run, dry_run_memory_counter, probe_container, dry_run_resource memory adaptor, reallocate() on both container policies, dry_run_resources RAII wrapper, and dry_run_execute template helper.
Scalar factory and copy guards
cpp/include/raft/core/device_mdarray.hpp, cpp/include/raft/core/host_mdarray.hpp, cpp/include/raft/core/managed_mdarray.hpp, cpp/include/raft/core/pinned_mdarray.hpp, cpp/include/raft/core/detail/copy.hpp
Guards value-write side-effects in make_*_scalar factories and detail::copy behind the dry-run flag.
Sparse matrix reallocate() adoption
cpp/include/raft/core/coo_matrix.hpp, cpp/include/raft/core/csr_matrix.hpp, cpp/include/raft/core/sparse_types.hpp
COO, CSR, and sparse matrix sparsity initialization switches from resize() to reallocate() so old memory is released before re-sizing.
Reduction and core linalg detail
cpp/include/raft/linalg/detail/reduce.cuh, cpp/include/raft/linalg/detail/coalesced_reduction-inl.cuh, cpp/include/raft/linalg/detail/matrix_vector_op.cuh, cpp/include/raft/linalg/detail/norm.cuh, cpp/include/raft/linalg/detail/axpy.cuh, cpp/include/raft/linalg/detail/map.cuh, cpp/include/raft/linalg/unary_op.cuh
Adds dry_run parameter to detail::reduce, coalesced reduction (thin/medium early-exit, thick allocation-pass-through), matrixVectorOp, norm callers; guards axpy, gemv, map, unary ops.
Public linalg API early-returns
cpp/include/raft/linalg/add.cuh, cpp/include/raft/linalg/coalesced_reduction.cuh, cpp/include/raft/linalg/reduce.cuh, cpp/include/raft/linalg/norm.cuh, cpp/include/raft/linalg/dot.cuh, cpp/include/raft/linalg/subtract.cuh, cpp/include/raft/linalg/...
Adds get_dry_run_flag early-returns to ~20 public linalg APIs; routes dry-run into detail implementations via the handle.
cuBLAS/cuSOLVER detail-level guards
cpp/include/raft/linalg/detail/cublaslt_wrappers.hpp, cpp/include/raft/linalg/detail/eig.cuh, cpp/include/raft/linalg/detail/svd.cuh, cpp/include/raft/linalg/detail/lstsq.cuh, cpp/include/raft/linalg/detail/qr.cuh, cpp/include/raft/linalg/detail/rsvd.cuh, cpp/include/raft/linalg/detail/transpose.cuh
Implements workspace-size-then-early-return pattern for all major solver entries: matmul, eigensolvers, SVD, QR, least-squares, RSVD, transpose.
Matrix subsystem guards
cpp/include/raft/matrix/argmax.cuh, cpp/include/raft/matrix/argmin.cuh, cpp/include/raft/matrix/col_wise_sort.cuh, cpp/include/raft/matrix/copy.cuh, cpp/include/raft/matrix/gather.cuh, cpp/include/raft/matrix/detail/*, cpp/include/raft/matrix/...
Adds early-returns to ~25 matrix APIs and their detail implementations, including gather/scatter, select-k (radix/warpsort), diagonal, fill, linewise-op, norm, shift, sort.
Random subsystem guards
cpp/include/raft/random/detail/make_blobs.cuh, cpp/include/raft/random/detail/make_regression.cuh, cpp/include/raft/random/detail/rng_impl.cuh, cpp/include/raft/random/rng.cuh, cpp/include/raft/random/...
Adds dry_run to RNG sampling primitives (discrete, sampleWithoutReplacement, excess_subsample) with CUB workspace pre-sizing; guards all univariate RNG APIs and data generation helpers.
Sparse subsystem guards
cpp/include/raft/sparse/convert/detail/*, cpp/include/raft/sparse/linalg/detail/*, cpp/include/raft/sparse/op/detail/*, cpp/include/raft/sparse/solver/detail/*, cpp/include/raft/sparse/matrix/detail/preprocessing.cuh
Adds workspace-then-early-return to sparse conversion (bitmap/bitset/COO→CSR, CUB scan), sort, dedupe, spmm, sddmm, csr_transpose, symmetrize, LAP Hungarian, Lanczos, Cholesky QR, randomized SVDs.
Stats subsystem guards
cpp/include/raft/stats/detail/*, cpp/include/raft/stats/*.cuh
Adds dry_run parameter and early-returns across all stats functions: accuracy, ARI, entropy (CUB histogram workspace pattern), silhouette, mean/variance/stddev, regression metrics (CUB radix-sort workspace), contingency matrix, and ~15 other stat primitives.
Spectral and bitset guards
cpp/include/raft/spectral/detail/*, cpp/include/raft/core/bitset.cuh, cpp/include/raft/core/bitset.hpp
Guards spectral matrix-vector multiplications (sparse_matrix_t, laplacian_matrix_t, modularity_matrix_t), partition/modularity analysis, spectral utilities, and bitset count/repeat host-side operations.
Test utility: execute_with_dry_run_check + new test suites
cpp/tests/test_utils.cuh, cpp/tests/util/dry_run_resources.cpp, cpp/tests/util/dry_run_guards.cu, cpp/tests/CMakeLists.txt
Adds alloc_behavior enum and execute_with_dry_run_check to test utilities; introduces comprehensive unit tests for dry-run infrastructure and end-to-end guard validation.
Existing test updates
cpp/tests/linalg/*, cpp/tests/matrix/*, cpp/tests/sparse/*, cpp/tests/stats/*, cpp/tests/core/*, cpp/tests/util/*, docs/source/...
Wraps ~80 existing test compute calls in execute_with_dry_run_check; adds new DryRun test cases for RSVD and SVD via dry_run_execute; adds documentation for the dry-run protocol.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.20% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the PR’s main theme of adding dry-run support across RAFT.
Description check ✅ Passed The description is clearly about the dry-run protocol and the related RAFT changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)
cpp/include/raft/matrix/slice.cuh (1)

52-60: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: Keep the public input validation before the dry-run exit.

Line 52 returns before the layout and bounds checks, so dry-run silently accepts slices that the real call would reject. That makes dry-run behavior diverge at the public API boundary.

Move the dry-run return after the RAFT_EXPECTS(...) block and before detail::sliceMatrix(...).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/matrix/slice.cuh` around lines 52 - 60, The public
validation in sliceMatrix is being skipped during dry-run because the early
resource::get_dry_run_flag(handle) return comes before the RAFT_EXPECTS checks.
Move the dry-run exit to after the existing layout and bounds validation in
sliceMatrix, so is_row_or_column_major and the row/col extent checks always run
before any return, and only detail::sliceMatrix is bypassed in dry-run mode.
cpp/include/raft/sparse/convert/detail/bitset_to_csr.cuh (1)

123-147: 🩺 Stability & Availability | 🔴 Critical

CRITICAL: Wrap both CUB scan calls in RAFT_CUDA_TRY.

cub::DeviceScan::ExclusiveSum returns cudaError_t; leaving either call unchecked can hide a failed workspace query or scan until later stream work.

Suggested fix
-  cub::DeviceScan::ExclusiveSum(
-    nullptr, scan_ws_bytes, sub_nnz.data(), sub_nnz.data(), sub_nnz_size + 1, stream);
+  RAFT_CUDA_TRY(cub::DeviceScan::ExclusiveSum(
+    nullptr, scan_ws_bytes, sub_nnz.data(), sub_nnz.data(), sub_nnz_size + 1, stream));
@@
-  cub::DeviceScan::ExclusiveSum(
-    scan_ws.data(), scan_ws_bytes, sub_nnz.data(), sub_nnz.data(), sub_nnz_size + 1, stream);
+  RAFT_CUDA_TRY(cub::DeviceScan::ExclusiveSum(
+    scan_ws.data(), scan_ws_bytes, sub_nnz.data(), sub_nnz.data(), sub_nnz_size + 1, stream));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/sparse/convert/detail/bitset_to_csr.cuh` around lines 123 -
147, Both cub::DeviceScan::ExclusiveSum calls in bitset_to_csr need to be
checked with RAFT_CUDA_TRY so CUDA failures are surfaced immediately. Wrap the
workspace-size query call and the actual scan call in RAFT_CUDA_TRY, keeping the
existing scan_ws_bytes and scan_ws usage intact, and make sure the fix is
applied in the bitset-to-CSR conversion path near calc_nnz_by_rows and csr
initialization.

Sources: Coding guidelines, Path instructions

cpp/include/raft/spectral/detail/matrix_wrappers.hpp (1)

106-118: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

HIGH: nrm1() now undercounts dry-run memory.

Returning 0 before thrust::reduce skips any temporary storage that the allocator-backed Thrust policy would request, so dry-run can underestimate memory during modularity_matrix_t construction. Have you considered keeping a temp-storage sizing/probe path here instead of short-circuiting before the reduce setup? As per path instructions, "Allocations through rmm/raft memory resources must NOT be guarded—allocation attempts must still be tracked to produce accurate memory statistics."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/spectral/detail/matrix_wrappers.hpp` around lines 106 - 118,
`nrm1()` is short-circuiting on dry-run before Thrust sets up its reduction,
which skips allocator-backed temporary storage accounting and underestimates
memory during `modularity_matrix_t` construction. Update
`raft::spectral::detail::matrix_wrappers::nrm1()` to preserve the Thrust
reduce/setup path for dry-run memory probing, and only avoid the actual
computation if needed after temp-storage sizing has been recorded; keep the
`thrust::reduce`/`thrust_policy` path reachable so rmm/raft allocations are
still tracked.

Source: Path instructions

cpp/include/raft/random/detail/rmat_rectangular_generator.cuh (1)

209-224: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: Dry-run bypasses theta shape validation

This return now happens before the host-side theta.extent(0) check, so dry-run accepts inputs that the real call would reject. That makes dry-run mask integration bugs instead of preserving the API contract.

As per coding guidelines, public APIs should “validate their metadata wherever possible,” and the dry-run guard should only skip the CUDA work.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/random/detail/rmat_rectangular_generator.cuh` around lines
209 - 224, The dry-run early return in rmat_rectangular_gen is bypassing the
host-side theta shape validation, so invalid metadata can slip through. Move the
theta.extent(0) check so it runs before the resource::get_dry_run_flag(handle)
return, while still keeping the CUDA work skipped in dry-run. Keep the existing
validation logic in the rmat_rectangular_gen path and ensure dry-run preserves
the same API contract as a real call.

Sources: Coding guidelines, Path instructions

cpp/include/raft/stats/contingency_matrix.cuh (1)

115-128: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

HIGH: get_input_class_cardinality() leaves host outputs indeterminate in dry-run

This early return skips both pointer validation and the writes to minLabel/maxLabel. Callers use this helper to size out_mat before contingency_matrix(), so the documented workflow is no longer safe under dry-run.

As per path instructions, dry-run support should keep raft::resources entry points safe to call, and the developer guide still expects metadata/host-side contract checks where possible.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/stats/contingency_matrix.cuh` around lines 115 - 128,
`get_input_class_cardinality()` currently returns immediately on dry-run,
leaving `minLabel` and `maxLabel` unset and skipping their validation. Update
this helper so the host outputs are always initialized to safe values (or
otherwise deterministically populated) even when
`resource::get_dry_run_flag(handle)` is true, while keeping the existing
`RAFT_EXPECTS` checks and the `detail::getInputClassCardinality` path intact for
normal execution. Reference the `get_input_class_cardinality`,
`resource::get_dry_run_flag`, and `detail::getInputClassCardinality` symbols
when adjusting the control flow.

Sources: Coding guidelines, Path instructions

cpp/include/raft/stats/detail/homogeneity_score.cuh (1)

46-56: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: Dry-run now reports perfect homogeneity for non-empty inputs.

When dry_run is true, mutual_info_score(...) and entropy(...) both short-circuit to 0, so the existing computedEntropy == 0 ? 1.0 : ... fallback returns a perfect score without doing any work. That makes dry-run observably wrong for any caller that reads the result during profiling. Consider short-circuiting dry_run in this wrapper before the entropy-zero special case and returning a neutral sentinel such as 0.0.

Suggested fix
 double homogeneity_score(bool dry_run,
                          const T* truthClusterArray,
                          const T* predClusterArray,
                          int size,
                          T lowerLabelRange,
                          T upperLabelRange,
                          cudaStream_t stream)
 {
   if (size == 0) return 1.0;
+  if (dry_run) return 0.0;
 
   double computedMI, computedEntropy;
 
   computedMI = mutual_info_score(
     dry_run, truthClusterArray, predClusterArray, size, lowerLabelRange, upperLabelRange, stream);
   computedEntropy =
     entropy(dry_run, truthClusterArray, size, lowerLabelRange, upperLabelRange, stream);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/stats/detail/homogeneity_score.cuh` around lines 46 - 56,
The homogeneity wrapper is returning a perfect score during dry-run because both
`mutual_info_score(...)` and `entropy(...)` can short-circuit to zero, which
then triggers the `computedEntropy == 0` fallback in `homogeneity_score(...)`.
Update this logic so `dry_run` is handled explicitly before the entropy-zero
branch, and return a neutral sentinel such as `0.0` instead of `1.0` when no
real computation is performed. Keep the change localized to the
`homogeneity_score` path in this header and preserve the existing non-dry-run
behavior for normal inputs.
cpp/include/raft/stats/detail/meanvar.cuh (1)

203-207: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

HIGH: Dry-run still queries CUDA occupancy in the row-major path.

Line 204 and Line 207 execute CUDA/device queries before the if (!dry_run) guard, so meanvar(..., dry_run=true, rowMajor=true) still touches the CUDA runtime. That breaks the dry-run contract and can fail on setups using dry-run specifically to avoid GPU work.

Suggested fix
-    int occupancy;
-    RAFT_CUDA_TRY(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
-      &occupancy, meanvar_kernel_rowmajor<T, I, BlockSize>, BlockSize, 0));
-    gs.y =
-      std::min(gs.y, raft::ceildiv<decltype(gs.y)>(occupancy * getMultiProcessorCount(), gs.x));
+    if (!dry_run) {
+      int occupancy;
+      RAFT_CUDA_TRY(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+        &occupancy, meanvar_kernel_rowmajor<T, I, BlockSize>, BlockSize, 0));
+      gs.y = std::min(
+        gs.y, raft::ceildiv<decltype(gs.y)>(occupancy * getMultiProcessorCount(), gs.x));
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/stats/detail/meanvar.cuh` around lines 203 - 207, The
row-major dry-run path in meanvar still performs CUDA runtime queries before the
dry_run check, so move the occupancy and multiprocessor-count logic behind the
existing dry_run guard. Update the meanvar row-major branch in meanvar.cuh so
meanvar(..., dry_run=true, rowMajor=true) skips
cudaOccupancyMaxActiveBlocksPerMultiprocessor and getMultiProcessorCount
entirely, only computing gs.y when not in dry-run mode.

Source: Path instructions

cpp/include/raft/core/host_mdarray.hpp (1)

219-228: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: Dry-run should not suppress host-scalar initialization.

This write is host-side only, so skipping it leaves make_host_scalar(res, v) uninitialized under dry-run and can make host-side branching or workspace sizing diverge from real execution. Consider keeping the assignment unconditional here.

Suggested fix
-  if (!resource::get_dry_run_flag(res)) { scalar(0) = v; }
+  scalar(0) = v;

As per path instructions, "Any expensive function or any function involving CUDA-calls must be guarded" and dry-run APIs taking raft::resources should remain safe to call.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/host_mdarray.hpp` around lines 219 - 228, The host-side
initialization in make_host_scalar is being skipped when
resource::get_dry_run_flag(res) is set, which leaves the returned host_scalar
uninitialized during dry-run. Update make_host_scalar in host_mdarray.hpp so the
assignment to scalar(0) = v happens unconditionally after construction, while
keeping any CUDA or expensive work guarded elsewhere; this preserves correct
host-side behavior without breaking dry-run safety.

Source: Path instructions

cpp/include/raft/core/pinned_mdarray.hpp (1)

115-122: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: Dry-run should not skip pinned-scalar initialization.

This is still a host-side write. Guarding it leaves make_pinned_scalar(handle, v) undefined in dry-run and can change host-visible control flow without improving allocation tracking.

Suggested fix
-  if (!resource::get_dry_run_flag(handle)) { scalar(0) = v; }
+  scalar(0) = v;

As per path instructions, "Any expensive function or any function involving CUDA-calls must be guarded" and dry-run APIs taking raft::resources should remain safe to call.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/pinned_mdarray.hpp` around lines 115 - 122, The dry-run
guard in make_pinned_scalar is incorrectly skipping the host-side assignment to
the pinned_scalar, leaving the returned scalar uninitialized in dry-run mode.
Remove the conditional around the scalar initialization in make_pinned_scalar so
the value is always written after constructing pinned_scalar<ElementType>, while
keeping any CUDA/allocation-related behavior safe through the existing
resource/policy path.

Source: Path instructions

cpp/include/raft/core/managed_mdarray.hpp (1)

115-122: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: Dry-run should still initialize managed scalars.

managed_scalar is host-accessible, so guarding this write can leave the value undefined during dry-run and make any host-side logic that inspects it diverge from the real path. The allocation is already preserved, so this assignment should stay unconditional.

Suggested fix
-  if (!resource::get_dry_run_flag(handle)) { scalar(0) = v; }
+  scalar(0) = v;

As per path instructions, "Any expensive function or any function involving CUDA-calls must be guarded" and dry-run APIs taking raft::resources should remain safe to call.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/managed_mdarray.hpp` around lines 115 - 122, The
make_managed_scalar helper is conditionally skipping initialization during
dry-run, which can leave managed_scalar<ElementType> uninitialized even though
it is host-accessible. Update make_managed_scalar so the scalar(0) = v
assignment is unconditional, while keeping the allocation path safe for dry-run;
use the existing make_managed_scalar and managed_scalar<ElementType> symbols to
locate the write and remove the resource::get_dry_run_flag(handle) guard around
it.

Source: Path instructions

🟠 Major comments (26)
cpp/include/raft/linalg/detail/norm.cuh-65-67 (1)

65-67: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: colNormCaller<LinfNorm> bypasses dry-run.

Line 67 passes a hardcoded false, so column-wise LinfNorm still runs the reduction path during dry-run instead of only tracking allocations.

Suggested fix
   } else if constexpr (norm_type == LinfNorm) {
     reduce<rowMajor, false, Type, OutType, IdxType>(
-      false, dots, data, D, N, (OutType)0, stream, false, raft::abs_op(), raft::max_op(), fin_op);
+      dry_run, dots, data, D, N, (OutType)0, stream, false, raft::abs_op(), raft::max_op(), fin_op);
   } else {

As per path instructions, “Any expensive function or any function involving CUDA-calls must be guarded via resource::get_dry_run_flag(res).”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/linalg/detail/norm.cuh` around lines 65 - 67,
`colNormCaller<LinfNorm>` is still invoking the reduction path during dry-run
because it passes a hardcoded false into `reduce`; update this branch to follow
the same dry-run guard used elsewhere by checking
`resource::get_dry_run_flag(res)` before calling `reduce`. Keep the fix
localized in `colNormCaller` within norm.cuh, and ensure the `LinfNorm` path
only performs allocation tracking when dry-run is enabled rather than executing
the CUDA reduction.

Source: Path instructions

cpp/include/raft/linalg/strided_reduction.cuh-132-132 (1)

132-132: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: Dry-run masks invalid strided_reduction calls.

Line 132 returns before the output-size checks and before the unsupported-type guard inside stridedReduction(...), so a dry-run can succeed for a call that the real path rejects immediately. Have you considered leaving the validation/type gate in place and skipping only the final launch?

As per coding guidelines, Public Interface says to “validate their metadata wherever possible.” Based on path instructions, dry-run should guard expensive/CUDA work via resource::get_dry_run_flag(res), not bypass cheap host-side validation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/linalg/strided_reduction.cuh` at line 132, The dry-run early
return in stridedReduction is skipping host-side validation, so invalid calls
can appear to succeed. Keep the metadata/output-size checks and unsupported-type
guard in place in stridedReduction, and use resource::get_dry_run_flag(handle)
only to skip the expensive CUDA launch path after validation has already run.

Sources: Coding guidelines, Path instructions

cpp/include/raft/linalg/add.cuh-105-107 (1)

105-107: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: Keep public mdspan validation outside the dry-run fast path.

Lines 107, 145, and 181 return before the contiguity/size RAFT_EXPECTS checks, so dry-run can report success for malformed inputs that the real execution path still rejects. Have you considered moving the dry-run guard to just before the actual kernel dispatch instead?

As per coding guidelines, Public Interface says to “validate their metadata wherever possible.” Based on path instructions, dry-run should guard expensive/CUDA work via resource::get_dry_run_flag(res), not bypass cheap host-side validation.

Also applies to: 145-145, 181-181

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/linalg/add.cuh` around lines 105 - 107, The dry-run early
return in add and the other public overloads is bypassing the mdspan contiguity
and size RAFT_EXPECTS checks, so malformed inputs can incorrectly succeed. Move
the resource::get_dry_run_flag(handle) guard in add, and the equivalent guards
in the other overloads, to just before the actual kernel dispatch so host-side
validation always runs while only the expensive CUDA work is skipped.

Sources: Coding guidelines, Path instructions

cpp/include/raft/matrix/sqrt.cuh-37-40 (1)

37-40: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: Preserve the shape checks in dry-run mode.

Both out-of-place overloads return before RAFT_EXPECTS(in.size() == out.size()), so dry-run hides mismatched shapes that real execution would fail on.

Validate sizes first, then short-circuit before launching the math kernels.

Also applies to: 76-83

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/matrix/sqrt.cuh` around lines 37 - 40, The out-of-place sqrt
overloads in the sqrt.cuh API are returning early on
resource::get_dry_run_flag(handle) before the RAFT_EXPECTS size check, which
lets mismatched input/output shapes slip through in dry-run mode. Move the shape
validation in the affected sqrt overloads (the out-of-place paths around the
seqRoot call) so RAFT_EXPECTS(in.size() == out.size()) runs before any dry-run
short-circuit, then keep the dry-run return only after validation and before
launching the kernel.
cpp/include/raft/matrix/detail/select_k-inl.cuh-131-132 (1)

131-132: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: This dry-run guard is too deep for the sorted radix path.

When sorted == true, the caller still launches raft::linalg::map_offset(...) to populate offsets before reaching this return, so dry-run still performs CUDA work on the radix+sort path. That breaks the dry-run contract even though the actual segmented sort is skipped.

As per path instructions, “Any expensive function or any function involving CUDA-calls must be guarded via resource::get_dry_run_flag(res).”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/matrix/detail/select_k-inl.cuh` around lines 131 - 132, The
dry-run guard in the select-k flow is placed too late, so the sorted radix path
still executes CUDA work through raft::linalg::map_offset before returning. Move
the resource::get_dry_run_flag(handle) check earlier in the select-k
implementation, before any path that can launch CUDA work, and ensure the guard
covers the sorted==true branch as well as the unsorted path. Use the select-k
logic around raft::linalg::map_offset and the surrounding sorted radix handling
to locate the fix.

Source: Path instructions

cpp/include/raft/matrix/detail/select_warpsort.cuh-1193-1193 (1)

1193-1193: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: This early return under-reports dry-run memory for warpsort.

select_k_() now does the right thing by allocating tmp_val/tmp_idx before checking dry_run, but Line 1193 returns before that helper is reached. Any caller using this overload in dry-run will miss those allocations and get a smaller peak-memory estimate.

As per path instructions, “Allocations through rmm/raft memory resources must NOT be guarded—allocation attempts must still be tracked to produce accurate memory statistics.”

Suggested fix
-  if (resource::get_dry_run_flag(res)) { return; }
   ASSERT(k <= kMaxCapacity, "Current max k is %d (requested %d)", kMaxCapacity, k);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/matrix/detail/select_warpsort.cuh` at line 1193, The dry-run
guard in the warpsort path is skipping allocation accounting, so peak-memory
estimates are too low. Update the overload in select_warpsort.cuh around the
select_k_() call site to avoid returning before the helper runs, and let the
tmp_val/tmp_idx allocation attempts be reached even when
resource::get_dry_run_flag(res) is true. Keep the dry-run behavior in the
downstream logic, but do not guard the rmm/raft-backed allocations that must be
tracked for memory statistics.

Source: Path instructions

cpp/include/raft/sparse/solver/detail/lanczos.cuh-198-227 (1)

198-227: 🩺 Stability & Availability | 🟠 Major

HIGH: Use the RAFT Thrust policy here.
thrust::sequence and thrust::sort on thrust::device can run outside handle’s stream, so the later raft::matrix::gather(...) calls may read unsynchronized indices. Use raft::resource::get_thrust_policy(handle) here instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/sparse/solver/detail/lanczos.cuh` around lines 198 - 227,
The index initialization and sorting in the lanczos selection path currently
uses the generic thrust::device policy, which can bypass the handle stream and
leave later gather operations racing unsynchronized data. Update the
thrust::sequence and both thrust::sort calls in this selection block to use the
RAFT Thrust policy from raft::resource::get_thrust_policy(handle), keeping the
existing indices/selected_indices logic intact so all work stays ordered on the
handle’s stream.

Sources: Coding guidelines, Path instructions

cpp/include/raft/solver/detail/lap_functions.cuh-267-276 (1)

267-276: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: Zero-cover dry-run undercounts its peak workspace.

executeZeroCover() now returns before zeroCoverIteration() is reached, and the dry-run branch inside zeroCoverIteration() only sizes the CSR buffers. The real path also allocates predicates_v and addresses_v at SP * N, so the dry-run report will miss the dominant memory in this phase.

As per path instructions, "Allocations through rmm/raft memory resources must NOT be guarded—allocation attempts must still be tracked to produce accurate memory statistics," and "early-return dry-run guards don’t accidentally skip required resource setup/teardown..."

Also applies to: 353-353

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/solver/detail/lap_functions.cuh` around lines 267 - 276, The
dry-run path in zero-cover is undercounting workspace because
`executeZeroCover()` now exits before `zeroCoverIteration()`, and
`zeroCoverIteration()`’s dry-run branch only allocates the CSR buffers while
skipping the `predicates_v` and `addresses_v` allocations that dominate memory
use. Update `executeZeroCover`/`zeroCoverIteration` so dry-run still performs
the same allocation tracking as the real path, including the `predicates_v` and
`addresses_v` `rmm::device_uvector` allocations, while still returning early
before any compute; keep the behavior aligned with the existing
`csr_ptrs_v`/`csr_neighbors_v` setup so memory statistics remain accurate.

Source: Path instructions

cpp/include/raft/sparse/linalg/detail/spmm.hpp-90-95 (1)

90-95: 🩺 Stability & Availability | 🟠 Major

HIGH: bufferSize is a byte count, but tmp is allocated as ValueType elements.

cusparsespmm_bufferSize returns bytes, so this over-allocates the workspace by sizeof(ValueType) and skews dry-run memory accounting. Use a byte buffer or convert the count to elements before allocating.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/sparse/linalg/detail/spmm.hpp` around lines 90 - 95, The
workspace allocation in the sparse SpMM path is using `bufferSize` as if it were
a count of `ValueType` elements, but `cusparsespmm_bufferSize` returns bytes.
Update the allocation in the `spmm`/workspace setup around `tmp` to treat the
buffer as raw bytes or to convert bytes to element count before constructing the
`rmm::device_uvector`, and make sure the dry-run path still accounts for the
same byte-sized workspace.
cpp/include/raft/linalg/detail/lstsq.cuh-213-216 (1)

213-216: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

HIGH: Clean up gesvdj_params on the non-dry-run path too.

Issue: the dry-run branch destroys gesvdj_params, but the normal path exits after Line 244 without cusolverDnDestroyGesvdjInfo.
Why: repeated lstsqSvdJacobi calls leak cuSOLVER resources.

As per path instructions, cpp/REVIEW_GUIDELINES.md requires checking resource cleanup on all paths.

Suggested fix
   raft::linalg::gemv(handle, U, n_rows, minmn, b, Ub, true, stream);
   raft::linalg::binaryOp(Ub, Ub, S, minmn, DivideByNonZero<math_t>(), stream);
   raft::linalg::gemv(handle, V, n_cols, minmn, Ub, w, false, stream);
+  RAFT_CUSOLVER_TRY(cusolverDnDestroyGesvdjInfo(gesvdj_params));
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/linalg/detail/lstsq.cuh` around lines 213 - 216, The
`lstsqSvdJacobi` flow cleans up `gesvdj_params` only in the
`resource::get_dry_run_flag(handle)` branch, so add matching
`cusolverDnDestroyGesvdjInfo(gesvdj_params)` cleanup on the normal execution
path before the function returns. Make sure every exit path in `lstsq.cuh` that
creates `gesvdj_params` also destroys it, using the same `gesvdj_params` symbol
so repeated calls do not leak cuSOLVER resources.

Source: Path instructions

cpp/include/raft/stats/detail/scores.cuh-162-184 (1)

162-184: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

HIGH: Use element counts for device_uvector and avoid untracked host allocation in dry-run.

Issue: device_uvector<double> sizes are element counts, so n * sizeof(double) allocates 8× too many doubles; dry-run also allocates h_sorted_abs_diffs(n) via std::vector before returning.
Why: this over-reports dry-run memory and can cause avoidable OOM.

As per path instructions, dry-run must keep rmm/raft allocation tracking accurate while skipping unnecessary work.

Suggested fix
-  int array_size = n * sizeof(double);
-  rmm::device_uvector<double> abs_diffs_array(array_size, stream);
-  rmm::device_uvector<double> sorted_abs_diffs(array_size, stream);
-  rmm::device_uvector<double> tmp_sums(2 * sizeof(double), stream);
+  rmm::device_uvector<double> abs_diffs_array(n, stream);
+  rmm::device_uvector<double> sorted_abs_diffs(n, stream);
+  rmm::device_uvector<double> tmp_sums(2, stream);
@@
-  std::vector<double> mean_errors(2);
-  std::vector<double> h_sorted_abs_diffs(n);
   int thread_cnt = 256;
   int block_cnt  = raft::ceildiv(n, thread_cnt);
 
   if (dry_run) { return; }
+
+  std::vector<double> mean_errors(2);
+  std::vector<double> h_sorted_abs_diffs(n);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/stats/detail/scores.cuh` around lines 162 - 184, The
allocation logic in the scores computation is using byte counts where
`rmm::device_uvector<double>` expects element counts, so `abs_diffs_array`,
`sorted_abs_diffs`, and `tmp_sums` should be sized in doubles rather than
`sizeof(double)` multiples; fix this in the relevant scoring routine before the
CUB `SortKeys` call. Also move the host-side `std::vector<double>
h_sorted_abs_diffs` allocation behind the `dry_run` early return (or otherwise
skip it during dry-run) so dry-run only accounts for tracked rmm/raft device
allocations and does not create untracked host memory.

Source: Path instructions

cpp/include/raft/random/detail/multi_variable_gaussian.cuh-188-190 (1)

188-190: 🩺 Stability & Availability | 🟠 Major

HIGH: Guard MVG setup before the constructor does CUDA work cpp/include/raft/random/detail/multi_variable_gaussian.cuh:146-190

Issue: give_gaussian() exits in dry-run, but build_multi_variable_gaussian_token_impl() still constructs multi_variable_gaussian_impl, which creates the CURAND generator and runs cuSOLVER buffer-size calls first.
Impact: dry-run paths still execute CUDA/library setup and can fail before reaching the guard.

Have you considered making the token/constructor setup dry-run-aware and deferring only the workspace accounting?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/random/detail/multi_variable_gaussian.cuh` around lines 188
- 190, The dry-run guard in give_gaussian() is too late because
multi_variable_gaussian_impl is still constructed earlier by
build_multi_variable_gaussian_token_impl(), which triggers CURAND and cuSOLVER
setup before the early return. Make the token/constructor path dry-run-aware by
checking resource::get_dry_run_flag(handle) before creating
multi_variable_gaussian_impl, and defer only the workspace accounting needed for
the token so dry-run execution avoids all CUDA/library initialization.

Source: Path instructions

cpp/include/raft/stats/detail/adjusted_rand_index.cuh-125-150 (1)

125-150: 🩺 Stability & Availability | 🟠 Major

Use a wider type for contingency sizing

Issue: nUniqClasses * nUniqClasses is evaluated in MathT; with the default MathT=int, it overflows past ~46k classes/samples.
Why: dry-run sets nUniqClasses = size, so large inputs can turn the allocation size into a bogus value before any work runs. The same risk exists in the non-dry-run path when the label range is large.
Consider computing the extent in size_t (or another wider type) before allocating the contingency buffers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/stats/detail/adjusted_rand_index.cuh` around lines 125 -
150, The contingency buffer sizing in adjusted_rand_index.cuh is computed in
MathT, which can overflow when nUniqClasses is large, especially in the dry_run
path where it can equal size. Update the sizing logic around the
dContingencyMatrix allocation and the nUniqClasses calculation in the adjusted
rand index helper to use a wider extent type such as size_t before multiplying,
then cast safely for the device allocation. Keep the existing dry_run and
countUnique flow intact, but ensure the final matrix element count is computed
in an overflow-safe type.

Sources: Coding guidelines, Path instructions

cpp/tests/core/temporary_device_buffer.cu-67-68 (1)

67-68: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: Raw-stream copy bypasses dry-run here.

This raft::copy(..., resource::get_cuda_stream(h)) call still performs real CUDA work during the dry-run pass, so the test can mutate result while claiming the path is dry-run compliant. Use the raft::resources overload (or guard the raw copy) inside the lambda.

Suggested fix
-        raft::copy(
-          result.data(), d_view.data_handle(), d_view.extent(0), resource::get_cuda_stream(h));
+        auto result_view = raft::make_device_vector_view<int>(result.data(), d_view.extent(0));
+        raft::copy(h, result_view, raft::make_const_mdspan(d_view));

As per path instructions, use raft::resources instead of raw streams/handles, and guard CUDA work in dry-run mode via resource::get_dry_run_flag(res).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/core/temporary_device_buffer.cu` around lines 67 - 68, The copy
inside the temporary device buffer test is using a raw CUDA stream, which
bypasses dry-run handling and can still perform real device work. Update the
lambda around the raft::copy call to use the raft::resources-based overload tied
to the existing resource handle, and gate the copy with
resource::get_dry_run_flag(res) so the dry-run pass does not mutate result. Keep
the fix localized to the temporary_device_buffer test and the lambda containing
raft::copy.

Source: Path instructions

cpp/tests/core/bitset.cu-331-332 (1)

331-332: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: Guard raw CUDA work inside the dry-run lambda.

Issue: Line 331 and Line 387 still launch GPU work during the dry-run pass.
Why: This violates dry-run semantics and can mask non-compliant kernels.

Suggested fix
-          RAFT_CUDA_TRY(cudaMemsetAsync(
-            repeat_device.data_handle(), 0, eval_n_elements * sizeof(bitset_t), stream));
+          if (!resource::get_dry_run_flag(h)) {
+            RAFT_CUDA_TRY(cudaMemsetAsync(repeat_device.data_handle(),
+                                          0,
+                                          eval_n_elements * sizeof(bitset_t),
+                                          resource::get_cuda_stream(h)));
+          }
...
-        raft::linalg::range(query_device.data_handle(), query_device.size(), stream);
+        if (!resource::get_dry_run_flag(h)) {
+          raft::linalg::range(
+            query_device.data_handle(), query_device.size(), resource::get_cuda_stream(h));
+        }

As per path instructions, “Any expensive function or any function involving CUDA-calls must be guarded via resource::get_dry_run_flag(res).”

Also applies to: 387-387

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/core/bitset.cu` around lines 331 - 332, Guard the raw CUDA work in
the dry-run path so it only runs when resource::get_dry_run_flag(res) is false.
Update the bitset test logic around the dry-run lambda in the relevant test
helpers (including the cudaMemsetAsync call and the other CUDA launch at the
referenced location) to skip these GPU operations during dry-run, keeping the
lambda’s non-CUDA validation separate from device work.

Source: Path instructions

cpp/tests/linalg/rsvd.cu-128-149 (1)

128-149: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: Percentage-based RSVD cases never hit the percentage API path.

Line 131 is dead here: params.k is already rewritten from 0 to a concrete rank earlier in SetUp(), so the ratio-based fixtures now run the fixed-rank Jacobi variants instead of rsvd_perc*. That drops coverage for the public percentage APIs and can hide dry-run regressions there.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/linalg/rsvd.cu` around lines 128 - 149, The percentage-based RSVD
branch in the RSVD test is unreachable because `SetUp()` rewrites `params.k`
before this lambda runs, so the test never exercises `rsvd_perc` or
`rsvd_perc_symmetric`. Update the selection logic in the `rsvd` test around
`raft::execute_with_dry_run_check` to branch on the original test mode or an
explicit percentage/fixed-rank flag instead of `params.k`, so the ratio-based
fixtures actually call the percentage APIs while fixed-rank cases continue to
use `rsvd_fixed_rank_jacobi` and `rsvd_fixed_rank_symmetric_jacobi`.
cpp/tests/sparse/csr_transpose.cu-97-113 (1)

97-113: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

HIGH: This dry-run wrapper breaks stream ordering with the test’s prepared inputs.

make_data() uploads into buffers on raft_handle/stream, but this lambda launches csr_transpose on a different raft::resources stream via resource::get_cuda_stream(h). Without an explicit dependency, transpose can observe partially populated inputs. Have you considered running execute_with_dry_run_check on raft_handle instead so setup and compute stay ordered on the same resource? As per path instructions, verify cpp/**/* changes against docs/source/developer_guide.md, especially “Asynchronous operations and stream ordering”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/sparse/csr_transpose.cu` around lines 97 - 113, The dry-run wrapper
in csr_transpose test is launching work on a different stream than the one used
by make_data(), which can break ordering and let csr_transpose read inputs
before uploads complete. Update the execute_with_dry_run_check call to use the
same raft_handle/resource context as the test setup so the prepared buffers and
the transpose kernel share stream ordering, and keep the relevant
resource/stream selection aligned through resource::get_cuda_stream and the
surrounding test harness.

Source: Path instructions

docs/source/dry_run_protocol.md-43-49 (1)

43-49: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

The example teaches the wrong stream ownership model.

This snippet introduces a public-style API that takes a raw cudaStream_t alongside raft::resources, which contradicts RAFT’s dry-run/resource guidance and can mislead users away from resource-ordered streams. Show auto stream = resource::get_cuda_stream(handle); instead of accepting a separate stream parameter. As per coding guidelines, prefer “raft::resources over raw handles/streams.” As per path instructions, docs changes should prioritize accuracy and consistency with current APIs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/source/dry_run_protocol.md` around lines 43 - 49, The dry-run example in
algo currently models an incorrect ownership pattern by accepting a raw
cudaStream_t alongside raft::resources, which conflicts with the resource-first
API guidance. Update the example to derive the stream from raft::resources using
resource::get_cuda_stream(handle) inside algo, and remove the separate stream
parameter so the signature and call sites reflect the preferred resource-ordered
model. Keep the rest of the snippet aligned with this RAFT API style and ensure
the doc example stays consistent with current dry-run guidance.

Sources: Coding guidelines, Path instructions

cpp/tests/stats/meanvar.cu-84-85 (1)

84-85: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

HIGH: The allocation assertion now depends on a private detail type.

Using sizeof(raft::stats::detail::mean_var<T>) bakes private workspace layout into the test. That makes this suite fail on benign internal refactors instead of public dry-run regressions.

As per coding guidelines, "Public API over detail. When re-using functionality or writing tests, call the public API rather than functions in the detail namespace." As per path instructions, verify the code follows the idioms in docs/source/developer_guide.md, especially the "Preferred APIs and idioms" section.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/stats/meanvar.cu` around lines 84 - 85, The allocation assertion in
the mean/var test is tied to the private mean_var detail type, so update the
check to use the public stats API or a public dry-run/size query instead of
sizeof(raft::stats::detail::mean_var<T>). Keep the existing rowMajor and
params.cols logic in the test, but replace the private workspace-size dependency
with the public symbol used by the mean/variance path so the test only validates
supported API behavior.

Sources: Coding guidelines, Path instructions

cpp/tests/stats/cov.cu-67-91 (1)

67-91: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

HIGH: This still leaves one cov entry point unverified under dry-run.

Lines 67-91 wrap the mdspan overload, but Line 106 still calls cov<false>(handle, ..., stream) directly. That means a regression in that public raft::resources path can slip through while this file appears dry-run covered.

As per path instructions, "Any function taking a raft::resources handle is safe to call in dry-run mode."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/stats/cov.cu` around lines 67 - 91, The dry-run coverage in cov.cu
only exercises the mdspan overload, while the public raft::resources entry point
cov<false>(handle, ..., stream) remains unverified. Update the test to also
invoke the raft::resources-based cov path inside the dry-run check, using the
existing cov and raft::execute_with_dry_run_check symbols, so both overloads are
covered under dry-run and regressions in the handle-taking API are caught.

Source: Path instructions

cpp/tests/util/bitonic_sort.cu-146-152 (1)

146-152: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: this dry-run callback still launches the kernel.

The lambda ignores h and calls the raw-stream bitonic_launch directly, so the dry-run leg still executes bitonic_kernel<<<...>>>. That means this test never actually validates “skip GPU work in dry-run mode”; it only re-runs the kernel. Have you considered guarding the launch with if (!resource::get_dry_run_flag(h)) or routing through a raft::resources-aware helper? As per path instructions, dry-run CUDA work must be guarded via resource::get_dry_run_flag(res) and RAFT code should prefer raft::resources over raw streams.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/util/bitonic_sort.cu` around lines 146 - 152, The dry-run test
callback in bitonic sort is still invoking the GPU launch unconditionally, so it
never actually verifies that dry-run skips work. Update the lambda passed to
execute_with_dry_run_check to use the provided raft::resources argument instead
of ignoring it, and guard the bitonic_launch invocation with
resource::get_dry_run_flag(h) (or equivalent raft::resources-aware helper) so
the kernel only runs when not in dry-run mode. Prefer threading raft::resources
through the launch path rather than calling the raw-stream bitonic_launch
directly.

Source: Path instructions

cpp/tests/sparse/symmetrize.cu-95-104 (1)

95-104: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

HIGH: out escapes the dry-run resource scope.

out is an owning COO created before execute_with_dry_run_check, then populated inside the callback and dereferenced afterward. If the dry-run leg resizes/fills those buffers against the temporary dry-run resources, the later assert_symmetry launch can end up reading storage whose allocator/resource context has already been torn down. Have you considered keeping the owning COO entirely inside the callback (and asserting only on the non-dry-run leg), or splitting the dry-run accounting from the functional validation? As per path instructions, early-return dry-run guards must not break required resource setup/teardown, and the developer guide requires dry-run allocation tracking to stay within the dry-run resource scope.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/sparse/symmetrize.cu` around lines 95 - 104, `out` is being created
outside `execute_with_dry_run_check`, so the owning COO can outlive the dry-run
resource scope and be used after the temporary allocator is torn down. Move the
`raft::sparse::COO<value_t, value_idx, nnz_t> out` setup and the
`symmetrize`/`assert_symmetry` validation into the non-dry-run path, or
otherwise keep the owning storage entirely within the callback used by
`execute_with_dry_run_check`. Use the `execute_with_dry_run_check` and
`raft::sparse::linalg::symmetrize` sites as the key locations to refactor so
dry-run accounting stays separate from functional validation.

Source: Path instructions

cpp/include/raft/core/detail/copy.hpp-403-405 (1)

403-405: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: This guard drops copy-side workspace allocations from dry-run.

detail::copy allocates an intermediate device_mdarray in the use_intermediate_src / use_intermediate_dst paths. Returning here skips those allocations entirely, so dry-run under-reports peak memory for layout/dtype conversion copies.

Suggested fix
-  if (resource::get_dry_run_flag(res)) { return; }
+  if constexpr (!(config::use_intermediate_src || config::use_intermediate_dst)) {
+    if (resource::get_dry_run_flag(res)) { return; }
+  }

As per path instructions, "Allocations through rmm/raft memory resources must NOT be guarded—allocation attempts must still be tracked" and dry-run guards must not skip required setup/teardown.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/detail/copy.hpp` around lines 403 - 405, The dry-run
early return in detail::copy is skipping workspace allocations needed by the
use_intermediate_src and use_intermediate_dst paths, so peak memory is
under-reported. Remove the blanket guard at the start of copy and ensure the
intermediate device_mdarray setup still runs under dry-run so allocation
attempts through the memory resource are tracked; only skip the actual data
movement if needed. Keep the fix localized to detail::copy and the
intermediate-copy branches that allocate temporary buffers.

Source: Path instructions

cpp/include/raft/core/sparse_types.hpp-181-181 (1)

181-181: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

HIGH: initialize_sparsity() no longer updates the sparse structure.

Line 181 only reallocates c_elements_. get_elements()/view() still size from structure_.get_nnz(), and any StructureType::initialize_sparsity() override is skipped, so the matrix can keep reporting the old nnz while the backing storage changed.

Suggested fix
-  void initialize_sparsity(nnz_type nnz) { c_elements_.reallocate(nnz); };
+  void initialize_sparsity(nnz_type nnz)
+  {
+    structure_.initialize_sparsity(nnz);
+    c_elements_.reallocate(nnz);
+  };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/sparse_types.hpp` at line 181, `initialize_sparsity()`
in `sparse_types.hpp` only reallocates `c_elements_`, so the sparse metadata
stays stale and any `StructureType::initialize_sparsity()` override is bypassed.
Update `initialize_sparsity(nnz_type nnz)` to also initialize the underlying
`structure_` and ensure `structure_.get_nnz()` reflects the new sparsity before
`get_elements()` and `view()` are used. Preserve the structure-specific
initialization path by invoking the structure’s own `initialize_sparsity`
behavior instead of replacing it with a bare reallocate.

Source: Path instructions

cpp/include/raft/core/dry_run_resources.hpp-54-74 (1)

54-74: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

HIGH: Serialize process-global resource swaps.

Issue: each active dry_run_resources instance snapshots and replaces process-global host/device resources without a lifetime lock.
Why: overlapping dry-run scopes can restore globals out of order, making unrelated allocations hit the wrong or destroyed adaptor.

Suggested fix
 `#include` <cstddef>
 `#include` <cstdint>
 `#include` <memory>
+#include <mutex>
 `#include` <utility>
@@
 class dry_run_resources : public resources {
  public:
   explicit dry_run_resources(const resources& existing)
     : resources(existing),
+      global_resource_lock_(global_resource_mutex()),
       active_(!resource::get_dry_run_flag(existing)),
       old_host_(raft::mr::get_default_host_resource()),
       old_device_(rmm::mr::get_current_device_resource_ref())
@@
  private:
+  static auto global_resource_mutex() -> std::recursive_mutex&
+  {
+    static std::recursive_mutex mutex;
+    return mutex;
+  }
+
+  // Hold while process-global memory resources are swapped.
+  std::unique_lock<std::recursive_mutex> global_resource_lock_;
+
   // Declaration order determines destruction order.

As per coding guidelines, RAFT algorithms should remain thread-safe with different raft::resources instances.

Also applies to: 159-192

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/dry_run_resources.hpp` around lines 54 - 74, Serialize
the process-global host/device resource swap performed by dry_run_resources so
overlapping active scopes cannot restore resources out of order. Update
dry_run_resources’ constructor/destructor path to take and hold a shared
lifetime lock around the existing snapshots and the calls to
raft::mr::set_default_host_resource and rmm::mr::set_current_device_resource,
and keep the lock until the base-class cleanup completes. Use the
dry_run_resources symbols and the init()/destructor flow to ensure only one
active instance can replace the global adaptors at a time while still allowing
separate raft::resources instances to remain thread-safe.

Source: Coding guidelines

cpp/include/raft/mr/dry_run_resource.hpp-80-98 (1)

80-98: 🩺 Stability & Availability | 🟠 Major

HIGH: Keep probe deallocation on the allocation stream.

probe_container allocates the async probe with the caller’s cuda::stream_ref, but its destructor always frees it on cudaStreamPerThread. For stream-ordered memory resources, that can break ordering and race with pending work on the original stream.

Suggested fix
 class probe_container {
   MR mr_;
   void* ptr_;
   std::size_t size_;
   std::size_t alignment_;
+  cuda::stream_ref stream_{cudaStreamPerThread};
@@
   probe_container(MR mr,
                   cuda::stream_ref stream,
                   std::size_t size,
                   std::size_t alignment = alignof(std::max_align_t))
-    : mr_(std::move(mr)), ptr_(nullptr), size_(size), alignment_(alignment)
+    : mr_(std::move(mr)), ptr_(nullptr), size_(size), alignment_(alignment), stream_(stream)
   {
     ptr_ = mr_.allocate(stream, size_, alignment_);
   }
@@
     if (ptr_ == nullptr) return;
     if constexpr (cuda::mr::resource<MR>) {
-      mr_.deallocate(cuda::stream_ref{cudaStreamPerThread}, ptr_, size_, alignment_);
+      mr_.deallocate(stream_, ptr_, size_, alignment_);
     } else {
       mr_.deallocate_sync(ptr_, size_, alignment_);
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/mr/dry_run_resource.hpp` around lines 80 - 98,
probe_container in dry_run_resource.hpp allocates using the caller-provided
cuda::stream_ref but currently deallocates on cudaStreamPerThread in the
destructor, which breaks stream ordering for cuda::mr::resource types. Update
probe_container to retain the allocation stream as state and use that same
stream when calling mr_.deallocate, while preserving the existing synchronous
deallocation path for non-resource MR implementations.

Source: Path instructions

Comment on lines +54 to +60
explicit dry_run_resources(const resources& existing)
: resources(existing),
active_(!resource::get_dry_run_flag(existing)),
old_host_(raft::mr::get_default_host_resource()),
old_device_(rmm::mr::get_current_device_resource_ref())
{
if (active_) init();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win

CRITICAL: Restore global resources if init() throws.

Issue: init() replaces global host/device resources, then continues through operations that can throw; if construction fails, ~dry_run_resources() is never called.
Why: the global resource can be left pointing at a dry-run adaptor that is being unwound, causing later allocations to use a dangling resource.

Suggested fix
   {
-    if (active_) init();
+    if (active_) {
+      try {
+        init();
+      } catch (...) {
+        resource::set_dry_run_flag(*this, false);
+        raft::mr::set_default_host_resource(old_host_);
+        rmm::mr::set_current_device_resource(old_device_);
+        resources_.clear();
+        factories_.clear();
+        throw;
+      }
+    }
   }

As per path instructions, dry-run guards must not skip required resource setup/teardown that can affect later non-dry-run calls.

Also applies to: 159-208

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/dry_run_resources.hpp` around lines 54 - 60, The
`dry_run_resources` construction path in `init()` can leave global host/device
resources pointing at the dry-run adaptor if an exception is thrown before
`~dry_run_resources()` runs. Update the `dry_run_resources(const resources&)`
constructor and/or `init()` so the old resources saved in `old_host_` and
`old_device_` are restored immediately on any failure during setup, using an
exception-safe guard/rollback path that also covers the later setup block
referenced by the same issue.

Source: Path instructions

Comment on lines +202 to 205
rmm::device_uvector<math_t> d_eig_vectors(memUsage == COPY_INPUT ? n_rows * n_cols : 0, stream);

if (resource::get_dry_run_flag(handle)) { return; }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

CRITICAL: eigSelDC(COPY_INPUT) truncates from an uninitialized buffer.

When memUsage == COPY_INPUT, Lines 225-245 still copy/solve into eig_vectors, but Lines 260-265 truncate from d_eig_vectors, which this branch never populates. That turns the copy-preserving path into silent garbage output.

Suggested fix
-  } else if (memUsage == COPY_INPUT) {
-    raft::matrix::copy(handle,
-                       make_device_matrix_view<const math_t>(in, n_rows, n_cols),
-                       make_device_matrix_view(eig_vectors, n_rows, n_cols));
+  } else if (memUsage == COPY_INPUT) {
+    raft::matrix::copy(handle,
+                       make_device_matrix_view<const math_t>(in, n_rows, n_cols),
+                       make_device_matrix_view(d_eig_vectors.data(), n_rows, n_cols));

     RAFT_CUSOLVER_TRY(cusolverDnsyevdx(cusolverH,
                                        CUSOLVER_EIG_MODE_VECTOR,
                                        CUSOLVER_EIG_RANGE_I,
                                        CUBLAS_FILL_MODE_UPPER,
                                        static_cast<int64_t>(n_rows),
-                                       eig_vectors,
+                                       d_eig_vectors.data(),
                                        static_cast<int64_t>(n_cols),
                                        math_t(0.0),
                                        math_t(0.0),
                                        static_cast<int64_t>(n_cols - n_eig_vals + 1),
                                        static_cast<int64_t>(n_cols),
                                        &h_meig,
                                        eig_vals,
                                        d_work.data(),
                                        lwork,
                                        d_dev_info.data(),
                                        stream));
   }

Also applies to: 224-245, 260-265

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/linalg/detail/eig.cuh` around lines 202 - 205, The
COPY_INPUT path in eigSelDC is truncating from d_eig_vectors even though that
buffer is never populated in that branch, so the preserved-input workflow
returns garbage. Update eigSelDC so the branch that copies/solves into
eig_vectors also truncates from the populated buffer (or otherwise routes
COPY_INPUT output from eig_vectors), and keep the existing dry-run handling in
resource::get_dry_run_flag and the eigSelDC copy/solve logic consistent with the
selected memUsage path.

Comment on lines +152 to 154
if (resource::get_dry_run_flag(handle)) { return; }
RAFT_EXPECTS(d_A.extent(0) == d_A.extent(0) && d_sums.extent(1) == n_unique_keys,
"Output is not of size ncols * n_unique_keys");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win

CRITICAL: Output-shape check is a tautology.

d_A.extent(0) == d_A.extent(0) never validates d_sums.extent(0), so an undersized output matrix can pass this guard and the downstream kernel can write out of bounds.

Suggested fix
-  RAFT_EXPECTS(d_A.extent(0) == d_A.extent(0) && d_sums.extent(1) == n_unique_keys,
+  RAFT_EXPECTS(d_sums.extent(0) == d_A.extent(0) && d_sums.extent(1) == n_unique_keys,
                "Output is not of size ncols * n_unique_keys");

As per path instructions, cpp/REVIEW_GUIDELINES.md says to prioritize “GPU/CUDA correctness” and “invalid memory access”. As per coding guidelines, public APIs should “validate their metadata wherever possible.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (resource::get_dry_run_flag(handle)) { return; }
RAFT_EXPECTS(d_A.extent(0) == d_A.extent(0) && d_sums.extent(1) == n_unique_keys,
"Output is not of size ncols * n_unique_keys");
if (resource::get_dry_run_flag(handle)) { return; }
RAFT_EXPECTS(d_sums.extent(0) == d_A.extent(0) && d_sums.extent(1) == n_unique_keys,
"Output is not of size ncols * n_unique_keys");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/linalg/reduce_rows_by_key.cuh` around lines 152 - 154, The
output-size guard in the reduce_rows_by_key path is incorrect because the first
extent check is a tautology and never validates the actual output buffer. Update
the RAFT_EXPECTS in reduce_rows_by_key.cuh to validate the row dimension of
d_sums against n_unique_keys (and keep the existing column check), using the
reduce_rows_by_key / d_sums / n_unique_keys symbols to locate the guard, so
undersized outputs are rejected before the kernel runs.

Sources: Coding guidelines, Path instructions

Comment on lines +278 to +307
// Query workspace size for sortPairs before dry-run check to track allocation
size_t workspace_size = 0;
cub::DeviceRadixSort::SortPairs(nullptr,
workspace_size,
expWts.data(),
sortedWts.data(),
inIdxPtr,
outIdxBuff.data(),
(int)len,
0,
sizeof(WeightsT) * 8,
stream);
rmm::device_uvector<char> workspace(workspace_size, stream);

if (resource::get_dry_run_flag(handle)) { return; }
kernel_dispatch<WeightsT, IdxT, 1, SamplingParams<WeightsT, IdxT>>(
expWts.data(), len, stream, params);
///@todo: use a more efficient partitioning scheme instead of full sort
// sort the array and pick the top sampledLen items
IdxT* outIdxPtr = outIdxBuff.data();
rmm::device_uvector<char> workspace(0, stream);
sortPairs(workspace, expWts.data(), sortedWts.data(), inIdxPtr, outIdxPtr, (int)len, stream);
cub::DeviceRadixSort::SortPairs(workspace.data(),
workspace_size,
expWts.data(),
sortedWts.data(),
inIdxPtr,
outIdxPtr,
(int)len,
0,
sizeof(WeightsT) * 8,
stream);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant code region and surrounding function.
sed -n '240,340p' cpp/include/raft/random/detail/rng_impl_deprecated.cuh

# Find how CUB SortPairs is handled elsewhere in the repo.
rg -n "DeviceRadixSort::SortPairs|RAFT_CUDA_TRY\\(cub::DeviceRadixSort::SortPairs|cub::DeviceRadixSort::SortPairs" cpp -g'*.{cuh,cu,h,hpp,cpp}'

Repository: rapidsai/raft

Length of output: 4827


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '1,120p' cpp/include/raft/util/detail/cub_wrappers.cuh
printf '\n--- rng_impl.cuh ---\n'
sed -n '280,350p' cpp/include/raft/random/detail/rng_impl.cuh
printf '\n--- columnWiseSort excerpt ---\n'
sed -n '290,350p' cpp/include/raft/matrix/detail/columnWiseSort.cuh

Repository: rapidsai/raft

Length of output: 7088


CRITICAL: Check the CUB radix-sort calls

cub::DeviceRadixSort::SortPairs returns an error code here, but both the workspace-size query and the actual sort drop it. A failed query can produce a bogus workspace size, and a failed sort will silently corrupt the sampled output. Use RAFT_CUDA_TRY(...) on both calls.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/random/detail/rng_impl_deprecated.cuh` around lines 278 -
307, The two cub::DeviceRadixSort::SortPairs calls in the RNG sampling path are
ignoring the returned error status, so failures in the workspace-size query or
the actual radix sort can go unnoticed; wrap both calls in RAFT_CUDA_TRY so
errors are surfaced immediately. Update the code in the sampling logic around
the workspace_size query and the later sort that uses workspace.data() and
outIdxBuff.data(), keeping the existing kernel_dispatch flow unchanged.

Sources: Coding guidelines, Path instructions

Comment on lines +309 to +339
// Query workspace size for sortPairs before dry-run check to track allocation
size_t workspace_size = 0;
cub::DeviceRadixSort::SortPairs(nullptr,
workspace_size,
expWts.data(),
sortedWts.data(),
inIdxPtr,
outIdxBuff.data(),
(int)len,
0,
sizeof(WeightsT) * 8,
stream);
rmm::device_uvector<char> workspace(workspace_size, stream);

if (dry_run) { return; }

RAFT_CALL_RNG_FUNC(rng_state, call_rng_kernel<1>, rng_state, stream, expWts.data(), len, params);

///@todo: use a more efficient partitioning scheme instead of full sort
// sort the array and pick the top sampledLen items
IdxT* outIdxPtr = outIdxBuff.data();
rmm::device_uvector<char> workspace(0, stream);
sortPairs(workspace, expWts.data(), sortedWts.data(), inIdxPtr, outIdxPtr, (int)len, stream);
cub::DeviceRadixSort::SortPairs(workspace.data(),
workspace_size,
expWts.data(),
sortedWts.data(),
inIdxPtr,
outIdxPtr,
(int)len,
0,
sizeof(WeightsT) * 8,
stream);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the target file around the cited regions and locate CUB calls in it.
git ls-files cpp/include/raft/random/detail/rng_impl.cuh
wc -l cpp/include/raft/random/detail/rng_impl.cuh
sed -n '280,360p' cpp/include/raft/random/detail/rng_impl.cuh
printf '\n---\n'
sed -n '390,470p' cpp/include/raft/random/detail/rng_impl.cuh
printf '\n---\n'
rg -n "Device(RadixSort|MergeSort|Select)::|RAFT_CUDA_TRY|cudaError_t" cpp/include/raft/random/detail/rng_impl.cuh

Repository: rapidsai/raft

Length of output: 7813


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check the exact CUB APIs used and their return types in installed headers if present,
# and inspect local usage patterns around wrapper macros.
rg -n "namespace cub|DeviceRadixSort|DeviceMergeSort|DeviceSelect" /usr/include /usr/local/include 2>/dev/null | head -n 50
printf '\n--- local references ---\n'
rg -n "RAFT_CUDA_TRY\\(cub::Device|cub::Device.*SortPairs|cub::DeviceSelect::UniqueByKey" cpp -g'*.cuh' -g'*.cu' -g'*.hpp' -g'*.cpp'

Repository: rapidsai/raft

Length of output: 151


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the target file around the cited regions and locate CUB calls in it.
git ls-files cpp/include/raft/random/detail/rng_impl.cuh
wc -l cpp/include/raft/random/detail/rng_impl.cuh
sed -n '280,360p' cpp/include/raft/random/detail/rng_impl.cuh
printf '\n---\n'
sed -n '390,470p' cpp/include/raft/random/detail/rng_impl.cuh
printf '\n---\n'
rg -n "Device(RadixSort|MergeSort|Select)::|RAFT_CUDA_TRY|cudaError_t" cpp/include/raft/random/detail/rng_impl.cuh

Repository: rapidsai/raft

Length of output: 7813


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check the exact CUB APIs used and their return types in installed headers if present,
# and inspect local usage patterns around wrapper macros.
rg -n "namespace cub|DeviceRadixSort|DeviceMergeSort|DeviceSelect" /usr/include /usr/local/include 2>/dev/null | head -n 50
printf '\n--- local references ---\n'
rg -n "RAFT_CUDA_TRY\\(cub::Device|cub::Device.*SortPairs|cub::DeviceSelect::UniqueByKey" cpp -g'*.cuh' -g'*.cu' -g'*.hpp' -g'*.cpp'

Repository: rapidsai/raft

Length of output: 151


CRITICAL: Wrap the new CUB calls with RAFT_CUDA_TRY
cpp/include/raft/random/detail/rng_impl.cuh:311-330, 408-457
The new cub::DeviceRadixSort::SortPairs, cub::DeviceMergeSort::SortPairs, and cub::DeviceSelect::UniqueByKey calls ignore their return codes. A failed workspace query or execution can leave invalid workspace sizing or stale outputs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/random/detail/rng_impl.cuh` around lines 309 - 339, The new
CUB invocations in rng_impl.cuh are not checking failure status, so wrap each
`cub::DeviceRadixSort::SortPairs`, `cub::DeviceMergeSort::SortPairs`, and
`cub::DeviceSelect::UniqueByKey` call in `RAFT_CUDA_TRY` to surface both
workspace-query and execution errors. Update the affected logic in the RNG
helpers that perform sorting/uniquing, keeping the existing `dry_run` flow and
workspace allocation intact, but ensure any returned CUB error aborts before
stale buffers or invalid workspace sizes are used.

Sources: Coding guidelines, Path instructions

Comment on lines +312 to +315
size_t scan_ws_bytes = 0;
cub::DeviceScan::ExclusiveSum(
nullptr, scan_ws_bytes, sub_nnz.data(), sub_nnz.data(), sub_nnz_size + 1, stream);
rmm::device_uvector<char> scan_ws(scan_ws_bytes, stream);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map the target file and inspect the relevant region.
ast-grep outline cpp/include/raft/sparse/convert/detail/bitmap_to_csr.cuh --view expanded || true
wc -l cpp/include/raft/sparse/convert/detail/bitmap_to_csr.cuh
sed -n '280,360p' cpp/include/raft/sparse/convert/detail/bitmap_to_csr.cuh

Repository: rapidsai/raft

Length of output: 3628


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look for the exact CUB usage and surrounding error handling patterns.
rg -n "DeviceScan::ExclusiveSum|RAFT_CUDA_TRY|cub::" cpp/include/raft/sparse/convert/detail/bitmap_to_csr.cuh cpp -g '!**/build/**' -g '!**/dist/**'

Repository: rapidsai/raft

Length of output: 50369


CRITICAL: Wrap both cub::DeviceScan::ExclusiveSum calls

  • Have you considered guarding both calls with RAFT_CUDA_TRY? A failed workspace query or scan would otherwise be ignored and can leave sub_nnz/indptr inconsistent.
  • The first call is also before the dry-run check, so dry-run mode still performs CUDA work here.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/sparse/convert/detail/bitmap_to_csr.cuh` around lines 312 -
315, Wrap both cub::DeviceScan::ExclusiveSum invocations in bitmap_to_csr.cuh
with RAFT_CUDA_TRY so both the workspace-size query and the actual scan are
checked for CUDA errors, and move or guard the initial workspace query so it is
skipped in dry-run mode; use the existing bitmap_to_csr conversion logic around
sub_nnz, indptr, and scan_ws to locate the fix.

Sources: Coding guidelines, Path instructions

Comment on lines +192 to +194
// Pre-allocate maximum distance buffer size (chunk * chunk) before dry-run guard
// to ensure allocation is tracked in dry-run mode and avoid reallocations in the loop
rmm::device_uvector<value_t> distances_buffer(chunk * chunk, stream);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical | 🏗️ Heavy lift

CRITICAL: Reused distance buffer races across stream-pool work.

Issue: pairwise_distance writes the single distances_buffer via handle, while compute_chunked_a_b reads it on chunk_stream, and later iterations reuse the same buffer before sync_stream_pool().
Why: This can nondeterministically corrupt a/b and the final silhouette score.

Have you considered using one distance buffer per active stream/chunk, or enqueueing the distance computation and compute_chunked_a_b on the same ordered stream before reusing the buffer?

As per path instructions, dry-run guards must not accidentally affect stream lifecycle, and the developer guide requires internal-stream work to be ordered against resource::get_cuda_stream().

Also applies to: 222-243

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/stats/detail/batched/silhouette_score.cuh` around lines 192
- 194, The shared distances buffer in the silhouette score chunking flow is
being reused across concurrent stream-pool work, which can cause races between
`pairwise_distance`, `compute_chunked_a_b`, and later iterations before
`sync_stream_pool()`. Fix this by ensuring each active stream/chunk has its own
distance buffer, or by making the distance computation and `compute_chunked_a_b`
run on the same ordered stream before the buffer is reused; keep the dry-run
guard and stream lifecycle ordering intact in `compute_chunked_a_b` and the
surrounding chunk loop.

Source: Path instructions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants