Skip to content
Draft
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
c6e880d
Add multi-segment GPU search: cuvsSelectK C API and Java bindings
jamxia155 Mar 31, 2026
390adec
Add parallel multi-segment CAGRA search with async memory resource
jamxia155 Apr 2, 2026
92dbcf9
Fix CudaStreamPool races and reduce multi-segment search overhead
jamxia155 Apr 2, 2026
82dcf71
Default initialize CAGRA search parameters
jamxia155 Apr 13, 2026
fad76cb
Enable persistent CAGRA kernel to be shared across index segments
jamxia155 Apr 13, 2026
22a2c8d
java: parallelize persistent-mode segment searches in MultiSegmentCag…
jamxia155 Apr 14, 2026
69c4771
Add native multi-segment CAGRA search and per-resources workspace pool
jamxia155 Apr 16, 2026
49e5a14
java: add DataType.HALF (float16) support to CuVSMatrix
jamxia155 Apr 20, 2026
29751c7
Fix errors after merge
jamxia155 Apr 23, 2026
2fffcb8
Remove dead code
jamxia155 Apr 23, 2026
d284bd0
Remove remnants of experimental code
jamxia155 May 1, 2026
1e012c0
Load libcudart.so automatically when libcuvs_c.so uses static CUDA
jamxia155 May 11, 2026
235fb32
Add multi-segment CAGRA search with bitset prefilter support
jamxia155 May 16, 2026
86a4f70
Add prefiltered ground truth generation to cuvs_bench
jamxia155 May 18, 2026
21bd700
Check for sufficient itopk_size
jamxia155 May 20, 2026
e1954b3
Update nomenclature to multi-partition
jamxia155 May 20, 2026
fdb025c
Refactor multi-partition CAGRA search to return merged global top-k
jamxia155 May 21, 2026
12d7f55
Allow multi-partition CAGRA search topk to exceed per-partition itopk…
jamxia155 May 30, 2026
7f479b8
Revert "Allow multi-partition CAGRA search topk to exceed per-partiti…
jamxia155 Jun 1, 2026
6b4e34f
Route multi-partition CAGRA search by params.algo
jamxia155 Jun 2, 2026
cf679b9
Enable MULTI_KERNEL in multi-partition CAGRA search
jamxia155 Jun 2, 2026
37a65d7
Tighten SINGLE_CTA multi-partition feasibility check
jamxia155 Jun 3, 2026
56e99ee
Add partition-aware MULTI_KERNEL expansion kernels
jamxia155 Jun 3, 2026
aa93362
Add unit tests for partition-aware MULTI_KERNEL kernels
jamxia155 Jun 3, 2026
4db0941
Fuse MULTI_KERNEL multi-partition into a single search call
jamxia155 Jun 5, 2026
876fb4d
Revert MULTI_KERNEL multi-partition work
jamxia155 Jun 5, 2026
f071508
Tighten SINGLE_CTA multi-partition feasibility check
jamxia155 Jun 5, 2026
efc7283
Generalize multi-partition post-processing for variable per-partition…
jamxia155 Jun 8, 2026
837767d
Add multi-partition MULTI_CTA search path
jamxia155 Jun 8, 2026
4b423ea
Gate multi-partition AUTO on (query, partition) CTA count
jamxia155 Jun 9, 2026
54a55f5
Merge remote-tracking branch 'origin/main' into multi-segment-cagra-s…
jamxia155 Jun 10, 2026
134f899
Updates to snapshotRepository and plugins
jolorunyomi Jun 25, 2026
0aa4b03
Resolve cudart symbols via FFI lookup for static CUDA builds
jamxia155 Jun 25, 2026
b1c3e12
Rename public CAGRA multi-partition search overloads to `search`
jamxia155 Jun 25, 2026
b35575f
Sync build.sh VERSION to 26.08.0-SNAPSHOT to match the pom
jamxia155 Jun 25, 2026
0c41987
Extract shared search_core for MULTI_CTA CAGRA search
jamxia155 Jun 26, 2026
22e337c
Formatting
jamxia155 Jun 29, 2026
301f5ef
Remove code remnants
jamxia155 Jun 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions c/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ add_library(
src/preprocessing/quantize/pq.cpp
src/preprocessing/quantize/scalar.cpp
src/distance/pairwise_distance.cpp
src/selection/select_k.cpp
)
add_library(cuvs::c_api ALIAS cuvs_c)
set_target_properties(
Expand Down
29 changes: 29 additions & 0 deletions c/include/cuvs/core/c_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,22 @@ cuvsError_t cuvsStreamSync(cuvsResources_t res);
*/
cuvsError_t cuvsDeviceIdGet(cuvsResources_t res, int* device_id);

/**
* @brief Configure the temporary workspace on this resources object as an uncapped pool, backed
* by the current device memory resource. After the initial reservation is allocated on
* first use, subsequent calls to cuvsRMMAlloc / cuvsRMMFree on the same resources handle
* hit the pool cache rather than calling cudaMallocAsync / cudaFreeAsync, reducing CUDA
* context lock contention under concurrent query threads. The pool grows without shrinking:
* freed allocations are returned to the pool rather than to the device, so the pool's
* high-water mark only increases until the resources object is destroyed.
*
* @param[in] res cuvsResources_t opaque C handle
* @param[in] initial_size_bytes initial pool reservation in bytes; size to cover the
* steady-state working set to avoid growth after warmup
* @return cuvsError_t
*/
cuvsError_t cuvsResourcesSetWorkspacePool(cuvsResources_t res, size_t initial_size_bytes);

/**
* @brief Create an Initialized opaque C handle for C++ type `raft::device_resources_snmg`
* for multi-GPU operations
Expand Down Expand Up @@ -210,6 +226,19 @@ cuvsError_t cuvsRMMFree(cuvsResources_t res, void* ptr, size_t bytes);
cuvsError_t cuvsRMMPoolMemoryResourceEnable(int initial_pool_size_percent,
int max_pool_size_percent,
bool managed);
/**
* @brief Switches the working memory resource to use stream-ordered asynchronous allocation
* (cudaMallocAsync / cudaFreeAsync). Unlike the pool resource, this resource returns memory to
* the stream immediately without blocking the CPU, eliminating device-wide synchronization on
* deallocation. This is especially beneficial when multiple CAGRA searches run concurrently on
* separate CUDA streams, because the internal workspace allocations no longer serialize kernel
* launches. Be aware that this function will change the memory resource for the whole process
* and the new memory resource will be used until explicitly changed.
*
* @return cuvsError_t
*/
cuvsError_t cuvsRMMAsyncMemoryResourceEnable();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "c_api.cpp" -path "*/c/src/core/*" | head -5

Repository: rapidsai/cuvs

Length of output: 79


🏁 Script executed:

cat -n c/src/core/c_api.cpp | sed -n '160,195p'

Repository: rapidsai/cuvs

Length of output: 1665


🏁 Script executed:

cd c/src/core && grep -n "thread_local" c_api.cpp

Repository: rapidsai/cuvs

Length of output: 269


🏁 Script executed:

cd c/include/cuvs/core && grep -n "cuvsRMMAsyncMemoryResourceEnable" c_api.h -A 15

Repository: rapidsai/cuvs

Length of output: 580


🏁 Script executed:

# Search for RMM set_current_device_resource usage patterns in the codebase
rg "set_current_device_resource" --type cpp -B 2 -A 2

Repository: rapidsai/cuvs

Length of output: 2101


🏁 Script executed:

# Check if there are any docs changes or migration guides in the PR
find . -name "*.md" -o -name "*.rst" | xargs grep -l "RMM\|AsyncMemory\|cuvsRMM" 2>/dev/null | head -10

Repository: rapidsai/cuvs

Length of output: 200


🏁 Script executed:

# Check for any deprecation warnings or migration documentation
find . -type f \( -name "*.h" -o -name "*.md" \) -path "*/docs/*" | xargs grep -l "cuvsRMM\|memory.*resource" 2>/dev/null | head -5

Repository: rapidsai/cuvs

Length of output: 89


🏁 Script executed:

# Check if there are any API change notes or migration guides in the docs
cat docs/source/api_basics.rst | head -100

Repository: rapidsai/cuvs

Length of output: 2518


🏁 Script executed:

# Check CHANGELOG for any RMM API changes mentioned
rg "Async|AsyncMemory|cuvsRMMAsync" CHANGELOG.md -B 2 -A 2

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

# Check if reset_current_device_resource clears the async_mr
cat -n c/src/core/c_api.cpp | sed -n '192,210p'

Repository: rapidsai/cuvs

Length of output: 784


🏁 Script executed:

# Check how pool memory resource is handled - it passes rvalues
# Let's see if there's any documentation on RMM's set_current_device_resource behavior
rg "set_current_device_resource" --type h | head -20

Repository: rapidsai/cuvs

Length of output: 331


🏁 Script executed:

# Look for RMM in vendored code or requirements to understand its semantics
find . -name "*.txt" -o -name "*.cmake" | xargs grep -i "rmm" | head -10

Repository: rapidsai/cuvs

Length of output: 779


🏁 Script executed:

# Check if there's any documentation update about async memory resource being process-wide
rg "cuvsRMMAsyncMemoryResourceEnable" docs/ 2>/dev/null || echo "No docs match found"

Repository: rapidsai/cuvs

Length of output: 76


🏁 Script executed:

# Let's check how the pool resource implementation differs from async
# The pool passes a rvalue (temporary), while async passes a pointer to thread_local
# Search for any notes on RMM memory resource ownership in comments
rg "ownership|lifetime|thread" c/src/core/c_api.cpp | head -20

Repository: rapidsai/cuvs

Length of output: 275


🏁 Script executed:

# Check if cuvsRMMPoolMemoryResourceEnable has similar issues by checking if it's thread_local or static
rg "pool_mr" c/src/core/c_api.cpp -B 5 -A 5

Repository: rapidsai/cuvs

Length of output: 665


🏁 Script executed:

# Let's check the RMM version used and look for any behavior documentation
rg "rmm" vcpkg.json conanfile.txt requirements.txt 2>/dev/null | head -10

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

# Check if there are any thread safety comments or warnings in the C API header
cat -n c/include/cuvs/core/c_api.h | sed -n '220,250p'

Repository: rapidsai/cuvs

Length of output: 1728


The async-memory resource owner cannot be thread_local when this API changes the current resource globally.

The implementation at c/src/core/c_api.cpp:188 stores cuda_async_memory_resource in thread_local async_mr and passes it to rmm::mr::set_current_device_resource(), but the documentation explicitly states this function "will change the memory resource for the whole process" (line 235). This creates a critical lifetime mismatch:

  • If set_current_device_resource() is device-scoped (affecting all threads), then when the enabling thread exits, its thread_local async_mr is destroyed while still registered as the current resource, leaving RMM with a dangling pointer.
  • If thread-local semantics were intended, the documentation must be updated to reflect that this only affects the calling thread.

The pool resource avoids this issue by passing temporary rvalues to set_current_device_resource(), allowing RMM to manage the lifetime. Either make async_mr process/device-scoped (not thread_local), or narrow the documentation and implementation to clarify thread-local semantics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@c/include/cuvs/core/c_api.h` around lines 229 - 240, The current
implementation of cuvsRMMAsyncMemoryResourceEnable stores
cuda_async_memory_resource in a thread_local async_mr and passes it to
rmm::mr::set_current_device_resource(), creating a lifetime mismatch vs the API
doc that says the change is global; change the implementation so the async
memory resource is process-scoped (not thread_local) by replacing thread_local
async_mr with a static/process-global instance (e.g., a static unique_ptr or
static object) so it outlives the thread and remains valid for
rmm::mr::set_current_device_resource(), and ensure
cuvsRMMAsyncMemoryResourceEnable and any cleanup use that same process-global
symbol (cuda_async_memory_resource / async_mr) when setting or resetting the
current device resource; alternatively, if thread-local semantics are intended,
update the cuvsRMMAsyncMemoryResourceEnable documentation to state it only
affects the calling thread and keep async_mr thread_local.


/**
* @brief Resets the memory resource to use the default memory resource (cuda_memory_resource)
* @return cuvsError_t
Expand Down
28 changes: 28 additions & 0 deletions c/include/cuvs/neighbors/cagra.h
Original file line number Diff line number Diff line change
Expand Up @@ -712,6 +712,34 @@ cuvsError_t cuvsCagraSearch(cuvsResources_t res,
DLManagedTensor* distances,
cuvsFilter filter);

/**
* @brief Search multiple CAGRA index segments concurrently using a single GPU kernel launch.
*
* Launches a single kernel with grid (1, num_queries, num_segments) so each CTA handles one
* (query, segment) pair concurrently. All results land in the caller-supplied device buffers
* on the same CUDA stream, so downstream operations (e.g. selectK) see them via stream ordering
* with no explicit synchronization needed.
*
* Only float32 datasets are currently supported. Distance values are comparable across segments
* (same scale) but are not postprocessed (no kScale correction) — they are suitable for
* relative comparison (selectK / recall).
*
* @param[in] res cuvsResources_t opaque C handle
* @param[in] params search parameters
* @param[in] num_segments number of index segments
* @param[in] indices array of num_segments cuvsCagraIndex_t pointers
* @param[in] queries array of num_segments DLManagedTensor* (device, float32, [nq, dim])
* @param[out] neighbors array of num_segments DLManagedTensor* (device, uint32, [nq, topk])
* @param[out] distances array of num_segments DLManagedTensor* (device, float32, [nq, topk])
*/
cuvsError_t cuvsCagraSearchMultiSegment(cuvsResources_t res,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to align on nomenclature a bit, I wonder if we can think of a more general name. Maybe "Partition"? Segment is pretty closely coupled to databases, and more specifically to LSM-based databases, but cuVS the library is more general that that. cuVS is at the level of "hash partitioning" or "blind sharding" (those are the terms we tend to use in this context). I think "MultiPartition" would be a more fitting name.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligning to "partition" for now. FYI, also considered: "MultiShard", "MultiIndex", "Federated", but these might come with unintended connotations.

cuvsCagraSearchParams_t params,
uint32_t num_segments,
cuvsCagraIndex_t* indices,
DLManagedTensor** queries,
DLManagedTensor** neighbors,
DLManagedTensor** distances);

/**
* @}
*/
Expand Down
37 changes: 37 additions & 0 deletions c/include/cuvs/selection/select_k.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
/*
* SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION.
* SPDX-License-Identifier: Apache-2.0
*/
#pragma once

#include <cuvs/core/c_api.h>
#include <dlpack/dlpack.h>

#ifdef __cplusplus
extern "C" {
#endif

/**
* @brief Select the k smallest values from a flat device array of n candidates.
*
* Treats `in_val` as a matrix of shape [1, n] and selects the `k` smallest
* float values. `out_idx` receives the int64 column positions of the selected
* values in [0, n), so the caller can recover per-segment identity as:
*
* segment_index = out_idx[j] / segment_k
* position_in_segment = out_idx[j] % segment_k
*
* @param[in] res cuvsResources_t handle
* @param[in] in_val DLManagedTensor* shape [1, n], float32, device memory
* @param[out] out_val DLManagedTensor* shape [1, k], float32, device memory
* @param[out] out_idx DLManagedTensor* shape [1, k], int64, device memory
* @return cuvsError_t
*/
cuvsError_t cuvsSelectK(cuvsResources_t res,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is great. I was just working on code examples for the new docs and realized we only have a C++ API for select_k. It'll be great to get the C APis, and later on the Python and other language wrappers for select-k.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the refactoring prompted by your other comment, select-k is no longer needed for this work. Leaving the C API intact in case it might be useful to others.

DLManagedTensor* in_val,
DLManagedTensor* out_val,
DLManagedTensor* out_idx);

#ifdef __cplusplus
}
#endif
38 changes: 33 additions & 5 deletions c/src/core/c_api.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,13 @@
#include <raft/core/device_resources_snmg.hpp>
#include <raft/core/resource/cuda_stream.hpp>
#include <raft/core/resource/device_id.hpp>
#include <raft/core/resource/device_memory_resource.hpp>
#include <raft/core/resource/resource_types.hpp>
#include <raft/core/resources.hpp>
#include <raft/util/cudart_utils.hpp>
#include <rapids_logger/logger.hpp>
#include <rmm/cuda_stream_view.hpp>
#include <rmm/mr/cuda_async_memory_resource.hpp>
#include <rmm/mr/cuda_memory_resource.hpp>
#include <rmm/mr/managed_memory_resource.hpp>
#include <rmm/mr/per_device_resource.hpp>
Expand All @@ -35,6 +37,19 @@ extern "C" cuvsError_t cuvsResourcesCreate(cuvsResources_t* res)
});
}

extern "C" cuvsError_t cuvsResourcesSetWorkspacePool(cuvsResources_t res, size_t initial_size_bytes)
{
return cuvs::core::translate_exceptions([=] {
auto res_ptr = reinterpret_cast<raft::resources*>(res);
// Create an uncapped pool: pre-warms with initial_size_bytes to avoid cudaMalloc on every
// query, but can grow beyond that if an allocation exceeds the initial reservation.
raft::resource::set_workspace_resource(
*res_ptr,
rmm::mr::pool_memory_resource{rmm::mr::get_current_device_resource_ref(),
initial_size_bytes});
});
}

extern "C" cuvsError_t cuvsResourcesDestroy(cuvsResources_t res)
{
return cuvs::core::translate_exceptions([=] {
Expand Down Expand Up @@ -132,20 +147,22 @@ extern "C" cuvsError_t cuvsRMMAlloc(cuvsResources_t res, void** ptr, size_t byte
{
return cuvs::core::translate_exceptions([=] {
auto res_ptr = reinterpret_cast<raft::resources*>(res);
auto mr = rmm::mr::get_current_device_resource_ref();
*ptr = mr.allocate(raft::resource::get_cuda_stream(*res_ptr), bytes);
auto stream = raft::resource::get_cuda_stream(*res_ptr);
*ptr = raft::resource::get_workspace_resource_ref(*res_ptr).allocate(stream, bytes);
});
}

extern "C" cuvsError_t cuvsRMMFree(cuvsResources_t res, void* ptr, size_t bytes)
{
return cuvs::core::translate_exceptions([=] {
auto res_ptr = reinterpret_cast<raft::resources*>(res);
auto mr = rmm::mr::get_current_device_resource_ref();
mr.deallocate(raft::resource::get_cuda_stream(*res_ptr), ptr, bytes);
auto stream = raft::resource::get_cuda_stream(*res_ptr);
raft::resource::get_workspace_resource_ref(*res_ptr).deallocate(stream, ptr, bytes);
});
}

thread_local std::shared_ptr<rmm::mr::cuda_async_memory_resource> async_mr;

extern "C" cuvsError_t cuvsRMMPoolMemoryResourceEnable(int initial_pool_size_percent,
int max_pool_size_percent,
bool managed)
Expand All @@ -164,9 +181,20 @@ extern "C" cuvsError_t cuvsRMMPoolMemoryResourceEnable(int initial_pool_size_per
});
}

extern "C" cuvsError_t cuvsRMMAsyncMemoryResourceEnable()
{
return cuvs::core::translate_exceptions([=] {
async_mr = std::make_shared<rmm::mr::cuda_async_memory_resource>();
rmm::mr::set_current_device_resource(*async_mr);
});
}

extern "C" cuvsError_t cuvsRMMMemoryResourceReset()
{
return cuvs::core::translate_exceptions([=] { rmm::mr::reset_current_device_resource(); });
return cuvs::core::translate_exceptions([=] {
rmm::mr::reset_current_device_resource();
async_mr.reset();
});
}

thread_local std::unique_ptr<rmm::mr::pinned_host_memory_resource> pinned_mr;
Expand Down
48 changes: 48 additions & 0 deletions c/src/neighbors/cagra.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -689,6 +689,54 @@ extern "C" cuvsError_t cuvsCagraSearch(cuvsResources_t res,
});
}

extern "C" cuvsError_t cuvsCagraSearchMultiSegment(cuvsResources_t res,
cuvsCagraSearchParams_t params,
uint32_t num_segments,
cuvsCagraIndex_t* indices,
DLManagedTensor** queries,
DLManagedTensor** neighbors,
DLManagedTensor** distances)
{
return cuvs::core::translate_exceptions([=] {
RAFT_EXPECTS(num_segments > 0, "num_segments must be > 0");
RAFT_EXPECTS(indices != nullptr && queries != nullptr && neighbors != nullptr &&
distances != nullptr,
"All pointer arrays must be non-null");

auto res_ptr = reinterpret_cast<raft::resources*>(res);
auto search_params = cuvs::neighbors::cagra::search_params();
convert_c_search_params(*params, &search_params);

// Only float32 is supported for multi-segment search.
RAFT_EXPECTS(
indices[0]->dtype.code == kDLFloat && indices[0]->dtype.bits == 32,
"Multi-segment search only supports float32 indices");

using T = float;
using IdxT = uint32_t;
using OutIdxT = uint32_t;
using DistanceT = float;
using IndexT = cuvs::neighbors::cagra::index<T, IdxT>;

std::vector<const IndexT*> idx_vec(num_segments);
std::vector<raft::device_matrix_view<const T, int64_t, raft::row_major>> q_vec(num_segments);
std::vector<raft::device_matrix_view<OutIdxT, int64_t, raft::row_major>> n_vec(num_segments);
std::vector<raft::device_matrix_view<DistanceT, int64_t, raft::row_major>> d_vec(num_segments);

for (uint32_t i = 0; i < num_segments; i++) {
RAFT_EXPECTS(indices[i] != nullptr && indices[i]->addr != 0,
"Index at position %u is null or not built", i);
idx_vec[i] = reinterpret_cast<const IndexT*>(indices[i]->addr);
q_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(q_vec[i])>>(queries[i]);
n_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(n_vec[i])>>(neighbors[i]);
d_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(d_vec[i])>>(distances[i]);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate every segment before the reinterpret_cast.

Only indices[0] is checked for float32. The loop then blindly casts every indices[i]->addr to index<float, uint32_t>* and builds float32 device views from queries[i], neighbors[i], and distances[i]. If any later segment is half/int8/uint8 or any tensor is not device-backed, this path will feed the kernel mismatched types and can produce garbage results or GPU faults. Please mirror the per-input dtype/device/null checks from cuvsCagraSearch for each segment before from_dlpack.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@c/src/neighbors/cagra.cpp` around lines 710 - 733, The loop currently casts
every indices[i]->addr and builds device views without validating each segment's
types/devices; update the loop that fills idx_vec, q_vec, n_vec, d_vec to
perform the same per-segment checks used in cuvsCagraSearch: assert indices[i]
!= nullptr and addr != 0 (already present), then verify indices[i]->dtype.code
== kDLFloat && indices[i]->dtype.bits == 32 for every i, and validate that
queries[i], neighbors[i], and distances[i] are device-backed DLPack tensors with
the expected element types (float for queries/distances, uint32 for neighbors)
before calling reinterpret_cast<const IndexT*>(indices[i]->addr) and
cuvs::core::from_dlpack to populate q_vec[i], n_vec[i], d_vec[i]; replace the
blind casts with RAFT_EXPECTS that include i in the error messages so a bad
segment fails fast and clearly.


cuvs::neighbors::cagra::search_multi_segment(
*res_ptr, search_params, idx_vec, q_vec, n_vec, d_vec);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject mixed-distance metrics across segments.

This API combines raw distances from all segments into one global ranking, so all indices must use the same metric. Right now nothing checks that indices[i] matches indices[0] on metric, which means mixing L2/IP/Cosine segments will return nonsensical top-k ordering even though the header says the distances are comparable across segments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@c/src/neighbors/cagra.cpp` around lines 726 - 736, The loop that builds
idx_vec/q_vec/n_vec/d_vec must also validate that all segment indices use the
same distance metric as the first segment to prevent mixing incompatible
metrics; inside the for-loop in cagra.cpp (after the existing RAFT_EXPECTS that
checks indices[i] non-null) compare indices[i]->metric (or the actual metric
field name on your IndexT struct/class) to indices[0]->metric and fail fast with
RAFT_EXPECTS (or equivalent) and a clear message like "Mixed distance metrics
across segments: expected %s but got %s at segment %u"; keep this check before
pushing idx_vec[i] and before calling
cuvs::neighbors::cagra::search_multi_segment so the function only runs when all
segments share the same metric.

});
}

extern "C" cuvsError_t cuvsCagraMerge(cuvsResources_t res,
cuvsCagraIndexParams_t params,
cuvsCagraIndex_t* indices,
Expand Down
42 changes: 42 additions & 0 deletions c/src/selection/select_k.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
/*
* SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION.
* SPDX-License-Identifier: Apache-2.0
*/

#include <cuvs/core/c_api.h>
#include "../core/exceptions.hpp"
#include <cuvs/selection/select_k.hpp>
#include <dlpack/dlpack.h>

#include <raft/core/device_mdspan.hpp>
#include <raft/core/resources.hpp>

extern "C" cuvsError_t cuvsSelectK(cuvsResources_t res,
DLManagedTensor* in_val,
DLManagedTensor* out_val,
DLManagedTensor* out_idx)
{
return cuvs::core::translate_exceptions([=] {
auto* res_ptr = reinterpret_cast<raft::resources*>(res);

int64_t n = in_val->dl_tensor.shape[1];
int64_t k = out_val->dl_tensor.shape[1];

auto in_view = raft::make_device_matrix_view<const float, int64_t, raft::row_major>(
static_cast<const float*>(in_val->dl_tensor.data), 1, n);

auto out_val_view = raft::make_device_matrix_view<float, int64_t, raft::row_major>(
static_cast<float*>(out_val->dl_tensor.data), 1, k);

auto out_idx_view = raft::make_device_matrix_view<int64_t, int64_t, raft::row_major>(
static_cast<int64_t*>(out_idx->dl_tensor.data), 1, k);

cuvs::selection::select_k(
*res_ptr,
in_view,
std::nullopt, // implicit positions [0, n) as in_idx
out_val_view,
out_idx_view,
true); // select_min = true (smallest distance = nearest neighbor)
Comment on lines +14 to +40

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate the DLPack contract before dereferencing shape[1] and casting buffers.

cuvsSelectK currently assumes non-null 2D CUDA tensors with float32 / float32 / int64 dtypes, zero offset, and contiguous row-major layout. Without checking ndim, shape, device, dtype, byte_offset, and stride compatibility, malformed callers can crash here or feed corrupted views into select_k.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@c/src/selection/select_k.cpp` around lines 14 - 40, cuvsSelectK dereferences
shape[1] and casts buffers without validating the DLPack tensors; add explicit
validation at the top of cuvsSelectK for in_val, out_val, out_idx (non-null),
then check each DLManagedTensor->dl_tensor for expected ndim (==2), shapes (rows
match expected 1 or compatible), dtype (in_val/out_val float32, out_idx int64),
device type (CUDA) and device id, byte_offset == 0, and contiguous row-major
strides/compatibility before creating device views with
raft::make_device_matrix_view; if any check fails return an appropriate
cuvsError_t (or throw inside translate_exceptions) instead of proceeding to
casts and calling cuvs::selection::select_k so malformed callers cannot crash or
corrupt memory.

});
}
Loading
Loading