[cccl.c]: Add `serialize()` and `deserialize()` functions to enable ahead-of-time compilation workflows by shwina · Pull Request #9568 · NVIDIA/cccl

shwina · 2026-06-23T16:08:18Z

Description

This PR adds cccl_<algo>_serialize and cccl_<algo>_deserialize APIs to all algorithms, which serialize/deserialize the <algo>_build_result_t structs into bytes. Callers can, e.g., write/read these bytes to disk to implement an ahead-of-time compilation workflow; or send bytes across the wire in a multi-node workflow.

Some notes:

These changes apply only to v1, v2 build structs have a totally different composition - we'll deal with that later.
The variable CCCL_C_PARALLEL_VERSION defined in aot.h represents the ABI version. Any changes in the ABI of the build structs should result in this version being bumped (currently set to 1).
There are probably clever ways to cut down on code duplication between all the implementations. I decided to keep things where they are given that v1 is on its way out.
I only added a test for round-tripping reduce_into. There will be a follow-up Python PR with more extensive testing.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

coderabbitai · 2026-06-23T16:23:45Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds experimental C AoT persistence for device build results: shared blob-format helpers and a free function, serialize/deserialize declarations, implementations for all algorithms, and benchmark/test coverage for round-trips and failures.

Changes

AOT build-result persistence

Layer / File(s)	Summary
Blob format and buffer API `c/parallel/include/cccl/c/aot.h`, `c/parallel/src/util/aot_serialize.h`, `c/parallel/src/aot.cpp`	Defines the experimental C AOT algorithm tags and version marker, the shared blob header and buffer reader/writer helpers, type-info serialization helpers, and the `cccl_aot_buffer_free` export.
Public C declarations `c/parallel/include/cccl/c/{binary_search.h,for.h,histogram.h,merge_sort.h,radix_sort.h,reduce.h,scan.h,segmented_reduce.h,segmented_sort.h,three_way_partition.h,transform.h,unique_by_key.h}`	Adds serialize/deserialize declarations across the public C headers, with output-buffer ownership, load sequencing, and failure-zeroing contracts.
Single-level round-trips `c/parallel/src/{for.cu,reduce.cu,scan.cu,transform.cu,three_way_partition.cu,unique_by_key.cu,segmented_reduce.cu}`	Implements serialize/deserialize pairs for the single-level algorithms, writing versioned blobs and restoring payload, runtime-policy, and lowered-name fields.
Nested round-trips `c/parallel/src/{binary_search.cu,histogram.cu,merge_sort.cu,radix_sort.cu,segmented_sort.cu}`	Implements nested blob handling for binary_search, histogram, merge_sort, radix_sort, and segmented_sort, including selector-op state and multiple embedded blobs.
Benchmarks and tests `c/parallel/test/test_bench_aot.cpp`, `c/parallel/test/test_reduce.cpp`	Adds AOT benchmark coverage for reduce, scan, and merge_sort, plus reduce cubin/LTOIR round-trip and negative deserialization tests.

Suggested reviewers

fbusato
NaderAlAwar

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 19

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (9)

c/parallel/src/histogram.cu-26-26 (1)

26-26: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

suggestion: use angle-bracket include syntax for the new util header.

As per coding guidelines, “All header inclusions must use the syntax <header>.”

Source: Coding guidelines

c/parallel/src/merge_sort.cu-28-28 (1)

28-28: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

suggestion: use angle-bracket include syntax for the new util header.

As per coding guidelines, “All header inclusions must use the syntax <header>.”

Source: Coding guidelines

c/parallel/src/radix_sort.cu-26-26 (1)

26-26: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

suggestion: use angle-bracket include syntax for the new util header.

As per coding guidelines, “All header inclusions must use the syntax <header>.”

Source: Coding guidelines

c/parallel/src/segmented_sort.cu-35-35 (1)

35-35: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

suggestion: use angle-bracket include syntax for the new util header.

As per coding guidelines, “All header inclusions must use the syntax <header>.”

Source: Coding guidelines

c/parallel/test/test_bench_aot.cpp-33-33 (1)

33-33: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

important: #include "test_util.h" should use angle-bracket include form in this codebase to satisfy repository header-inclusion policy.

As per coding guidelines, "All header inclusions must use the syntax <header>".

Source: Coding guidelines

python/cuda_cccl/tests/compute/test_merge_sort_aot.py-78-78 (1)

78-78: 🎯 Functional Correctness | 🟡 Minor

important: Line 78 uses np.argsort(..., stable=True), which requires NumPy 2.0.0+. Since cuda-cccl does not specify a minimum NumPy version, use kind="stable" instead, available since NumPy 1.15.0, to preserve stable ordering with broader compatibility.

c/parallel/src/segmented_reduce.cu-34-35 (1)

34-35: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

suggestion: Use angle-bracket syntax for these local utility includes.

The changed include block should use <util/aot_serialize.h> and <util/nvjitlink.h>. As per coding guidelines, "All header inclusions must use the syntax <header>."

Source: Coding guidelines

python/cuda_cccl/cuda/compute/_bindings_impl.pyx-1175-1177 (1)

1175-1177: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

important: Guard empty blobs before taking &view[0].

An empty .aot file currently raises IndexError in Cython before the native deserializer can return CUDA_ERROR_INVALID_VALUE. Check view.shape[0] == 0 before indexing, or pass NULL with size 0 and let the backend produce the consistent deserialize error.

Also applies to: 1541-1543, 1725-1727, 1905-1907, 2093-2095, 2287-2289, 2456-2458, 2566-2568, 2765-2767, 2932-2934, 3124-3126, 3318-3320

python/cuda_cccl/cuda/compute/algorithms/_scan.py-407-414 (1)

407-414: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

suggestion: Give load_inclusive_scan the same init_value=None default as make_inclusive_scan and inclusive_scan. The reconstruction path already handles None, but the public loader currently raises TypeError unless callers pass init_value=None explicitly for no-init inclusive scans.

🧹 Nitpick comments (3)

c/parallel/test/test_reduce.cpp (1)

972-974: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

suggestion: The test says failure leaves the build zeroed, but it only checks build.payload. Add checks for other pointer fields (for example runtime_policy and single_tile_kernel_lowered_name) so partial-deserialize regressions are caught.

python/cuda_cccl/tests/compute/test_reduce_aot.py (1)

117-119: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

suggestion: This test verifies the determinism field value but does not execute loaded. Add one execution call plus output assertion so _from_serialized path selection is validated behaviorally, not only via metadata.

python/cuda_cccl/tests/compute/bench_aot.py (1)

83-127: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

suggestion: Add lightweight correctness assertions in each timed path (reduce sum, scan boundary values, merge_sort first/last key). Without output checks, the benchmark can report latency for a semantically broken execution path.

Also applies to: 141-178, 197-241

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 49794fee-65b5-440b-809a-a43c062878f3

📥 Commits

Reviewing files that changed from the base of the PR and between f5c350b and 56a7df5.

📒 Files selected for processing (57)

c/parallel/include/cccl/c/aot.h
c/parallel/include/cccl/c/binary_search.h
c/parallel/include/cccl/c/for.h
c/parallel/include/cccl/c/histogram.h
c/parallel/include/cccl/c/merge_sort.h
c/parallel/include/cccl/c/radix_sort.h
c/parallel/include/cccl/c/reduce.h
c/parallel/include/cccl/c/scan.h
c/parallel/include/cccl/c/segmented_reduce.h
c/parallel/include/cccl/c/segmented_sort.h
c/parallel/include/cccl/c/three_way_partition.h
c/parallel/include/cccl/c/transform.h
c/parallel/include/cccl/c/unique_by_key.h
c/parallel/src/aot.cpp
c/parallel/src/binary_search.cu
c/parallel/src/for.cu
c/parallel/src/histogram.cu
c/parallel/src/merge_sort.cu
c/parallel/src/radix_sort.cu
c/parallel/src/reduce.cu
c/parallel/src/scan.cu
c/parallel/src/segmented_reduce.cu
c/parallel/src/segmented_sort.cu
c/parallel/src/three_way_partition.cu
c/parallel/src/transform.cu
c/parallel/src/unique_by_key.cu
c/parallel/src/util/aot_serialize.h
c/parallel/test/test_bench_aot.cpp
c/parallel/test/test_reduce.cpp
python/cuda_cccl/cuda/compute/__init__.py
python/cuda_cccl/cuda/compute/_bindings.pyi
python/cuda_cccl/cuda/compute/_bindings_impl.pyx
python/cuda_cccl/cuda/compute/algorithms/__init__.py
python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
python/cuda_cccl/cuda/compute/algorithms/_histogram.py
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_scan.py
python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_sort/__init__.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
python/cuda_cccl/cuda/compute/algorithms/_transform.py
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
python/cuda_cccl/tests/compute/bench_aot.py
python/cuda_cccl/tests/compute/test_binary_search_aot.py
python/cuda_cccl/tests/compute/test_histogram_aot.py
python/cuda_cccl/tests/compute/test_merge_sort_aot.py
python/cuda_cccl/tests/compute/test_radix_sort_aot.py
python/cuda_cccl/tests/compute/test_reduce_aot.py
python/cuda_cccl/tests/compute/test_scan_aot.py
python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
python/cuda_cccl/tests/compute/test_transform_aot.py
python/cuda_cccl/tests/compute/test_unique_by_key_aot.py

copy-pr-bot · 2026-06-23T17:22:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shwina · 2026-06-23T17:25:43Z

@coderabbitai full review

coderabbitai · 2026-06-23T17:25:50Z

✅ Action performed

Full review finished.

coderabbitai

Actionable comments posted: 18

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (3)

python/cuda_cccl/cuda/compute/_bindings_impl.pyx-1175-1177 (1)
1175-1177: 🎯 Functional Correctness | 🟡 Minor

important: Guard &view[0] indexing in deserialize methods to preserve backend error handling.

Unconditional indexing of view[0] causes an IndexError when blob is empty (b""), preventing the backend deserializer from returning its intended error. Check buf_size first and pass NULL when zero:
Diff for lines 1175–1177 (and apply to all 12 deserialize methods)
         cdef const unsigned char[::1] view = blob
-        cdef const void* buf_ptr = <const void*>&view[0]
         cdef size_t buf_size = view.shape[0]
+        cdef const void* buf_ptr = NULL
+        if buf_size != 0:
+            buf_ptr = <const void*>&view[0]
Applies to all twelve deserialize() methods at lines 1176, 1542, 1726, 1906, 2094, 2288, 2457, 2567, 2766, 2933, 3125, 3319.

Source: Path instructions
c/parallel/src/util/aot_serialize.h-13-19 (1)

13-19: 📐 Maintainability & Code Quality | 🟡 Minor

important: add direct includes for <type_traits> and <cstddef>. Line 94 uses std::is_trivially_copyable_v and lines 81, 110, 119 use size_t without including the headers that define these symbols. Per coding guidelines, direct includes are required; relying on transitive includes is not allowed.

Source: Coding guidelines

c/parallel/src/transform.cu-39-39 (1)

39-39: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

important: use angle-bracket syntax for this project header include.

#include "util/aot_serialize.h" violates the repository rule that all header inclusions use <header> syntax; use <util/aot_serialize.h>. As per coding guidelines, “All header inclusions must use the syntax <header>.”

Source: Coding guidelines

🧹 Nitpick comments (6)

c/parallel/test/test_reduce.cpp (1)

972-973: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

suggestion: Strengthen failure-state assertions in the bad-blob test. Only checking build.payload can miss regressions where other owned pointers remain non-null after deserialize failure; also assert runtime_policy and lowered-name pointers are null to fully validate the zeroed-build contract.

As per path instructions, focus on C API error/status handling and lifetime ownership for c/**/*.

Source: Path instructions

python/cuda_cccl/tests/compute/bench_aot.py (1)

47-62: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

suggestion: Make the benchmark order-neutral to reduce cold-start bias. Running JIT first and AoT second in every trial systematically gives AoT the second-run advantage; run both orders (or alternate order across repetitions) and report median/mean per mode.

As per path instructions, focus on Python test/example correctness for python/cuda_cccl/**/*.

Source: Path instructions

python/cuda_cccl/cuda/compute/algorithms/_transform.py (1)

52-58: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

suggestion: Deserialize the blob first, then rebuild iterators/op.
Line 52 and Line 135 currently do iterator conversion and make_op_adapter(op).compile(...) before blob validation. Reordering to call _bindings.DeviceUnaryTransform.deserialize(blob) / _bindings.DeviceBinaryTransform.deserialize(blob) first avoids unnecessary work on invalid blobs and keeps the AOT load path leaner for callable ops. As per path instructions, python/cuda_cccl/**/*: “Focus on ... JIT/NVRTC/nvJitLink behavior”.

Also applies to: 135-143

Source: Path instructions

python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py (1)

74-83: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

suggestion: Fail fast on blob parse before doing op compilation.
Line 74 initializes iterators/op state before DeviceUniqueByKeyBuildResult.deserialize(blob) at Line 82. Swapping this order makes corrupted-blob handling cheaper and avoids unnecessary compile/setup work in the deserialize path. As per path instructions, python/cuda_cccl/**/*: “Focus on ... JIT/NVRTC/nvJitLink behavior”.

Source: Path instructions

c/parallel/src/util/aot_serialize.h (1)

30-32: 🗄️ Data Integrity & Integration | 🔵 Trivial | 🏗️ Heavy lift

important: make the blob format version algorithm-specific.

kFormatVersion is shared by every algorithm even though the comment and blob header describe a per-algorithm layout version. A future layout bump for one algorithm would reject saved blobs for every other algorithm; pass the expected/written version from each algorithm or explicitly document this as a global format version.

As per path instructions, focus on C API/ABI stability, error/status handling, lifetime ownership, runtime policy handling, build-result caching, C/C++ boundary behavior, and coverage for public C parallel APIs.

Also applies to: 240-267

Source: Path instructions

c/parallel/include/cccl/c/for.h (1)

81-85: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

suggestion: document out-buffer ownership and deserialize failure semantics for this public C API. This declaration block currently omits the allocator-boundary contract (cccl_aot_buffer_free) and output-state guarantees on deserialize failure, which increases misuse risk for external C callers; mirror the reduce.h contract text here (and in the other newly added algorithm headers for consistency). As per path instructions, “Focus on C API/ABI stability, error/status handling, lifetime ownership, [and] C/C++ boundary behavior.”

Source: Path instructions

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ccb178f0-abaa-425c-a224-a7ebbc1cf703

📥 Commits

Reviewing files that changed from the base of the PR and between 3726b6c and ad6c8b1.

📒 Files selected for processing (54)

c/parallel/include/cccl/c/aot.h
c/parallel/include/cccl/c/binary_search.h
c/parallel/include/cccl/c/for.h
c/parallel/include/cccl/c/histogram.h
c/parallel/include/cccl/c/merge_sort.h
c/parallel/include/cccl/c/radix_sort.h
c/parallel/include/cccl/c/reduce.h
c/parallel/include/cccl/c/scan.h
c/parallel/include/cccl/c/segmented_reduce.h
c/parallel/include/cccl/c/segmented_sort.h
c/parallel/include/cccl/c/three_way_partition.h
c/parallel/include/cccl/c/transform.h
c/parallel/include/cccl/c/unique_by_key.h
c/parallel/src/aot.cpp
c/parallel/src/binary_search.cu
c/parallel/src/for.cu
c/parallel/src/histogram.cu
c/parallel/src/merge_sort.cu
c/parallel/src/radix_sort.cu
c/parallel/src/reduce.cu
c/parallel/src/scan.cu
c/parallel/src/segmented_reduce.cu
c/parallel/src/segmented_sort.cu
c/parallel/src/three_way_partition.cu
c/parallel/src/transform.cu
c/parallel/src/unique_by_key.cu
c/parallel/src/util/aot_serialize.h
c/parallel/test/test_bench_aot.cpp
c/parallel/test/test_reduce.cpp
python/cuda_cccl/cuda/compute/_bindings.pyi
python/cuda_cccl/cuda/compute/_bindings_impl.pyx
python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
python/cuda_cccl/cuda/compute/algorithms/_histogram.py
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_scan.py
python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
python/cuda_cccl/cuda/compute/algorithms/_transform.py
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
python/cuda_cccl/tests/compute/bench_aot.py
python/cuda_cccl/tests/compute/test_binary_search_aot.py
python/cuda_cccl/tests/compute/test_histogram_aot.py
python/cuda_cccl/tests/compute/test_merge_sort_aot.py
python/cuda_cccl/tests/compute/test_radix_sort_aot.py
python/cuda_cccl/tests/compute/test_reduce_aot.py
python/cuda_cccl/tests/compute/test_scan_aot.py
python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
python/cuda_cccl/tests/compute/test_transform_aot.py
python/cuda_cccl/tests/compute/test_unique_by_key_aot.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

python/cuda_cccl/tests/compute/bench_aot.py (1)

47-63: 🎯 Functional Correctness | 🔵 Trivial

Consider alternating benchmark order to eliminate JIT warm-up bias. The clear_all_caches() calls only clear Python-level function caches, not the CUDA runtime's module cache. Since JIT runs first (line 52), any compiled kernels and driver state remain in memory for the AoT path (line 59), potentially lowering AoT timing. Run both order sequences (JIT→AoT and AoT→JIT) and report median, or deserialize in isolated subprocesses for true cold-start measurement.

Source: Path instructions

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a594ceb0-c175-4514-b3f6-b767133bf1dc

📥 Commits

Reviewing files that changed from the base of the PR and between ad6c8b1 and 08c3d03.

📒 Files selected for processing (54)

c/parallel/include/cccl/c/aot.h
c/parallel/include/cccl/c/binary_search.h
c/parallel/include/cccl/c/for.h
c/parallel/include/cccl/c/histogram.h
c/parallel/include/cccl/c/merge_sort.h
c/parallel/include/cccl/c/radix_sort.h
c/parallel/include/cccl/c/reduce.h
c/parallel/include/cccl/c/scan.h
c/parallel/include/cccl/c/segmented_reduce.h
c/parallel/include/cccl/c/segmented_sort.h
c/parallel/include/cccl/c/three_way_partition.h
c/parallel/include/cccl/c/transform.h
c/parallel/include/cccl/c/unique_by_key.h
c/parallel/src/aot.cpp
c/parallel/src/binary_search.cu
c/parallel/src/for.cu
c/parallel/src/histogram.cu
c/parallel/src/merge_sort.cu
c/parallel/src/radix_sort.cu
c/parallel/src/reduce.cu
c/parallel/src/scan.cu
c/parallel/src/segmented_reduce.cu
c/parallel/src/segmented_sort.cu
c/parallel/src/three_way_partition.cu
c/parallel/src/transform.cu
c/parallel/src/unique_by_key.cu
c/parallel/src/util/aot_serialize.h
c/parallel/test/test_bench_aot.cpp
c/parallel/test/test_reduce.cpp
python/cuda_cccl/cuda/compute/_bindings.pyi
python/cuda_cccl/cuda/compute/_bindings_impl.pyx
python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
python/cuda_cccl/cuda/compute/algorithms/_histogram.py
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_scan.py
python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
python/cuda_cccl/cuda/compute/algorithms/_transform.py
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
python/cuda_cccl/tests/compute/bench_aot.py
python/cuda_cccl/tests/compute/test_binary_search_aot.py
python/cuda_cccl/tests/compute/test_histogram_aot.py
python/cuda_cccl/tests/compute/test_merge_sort_aot.py
python/cuda_cccl/tests/compute/test_radix_sort_aot.py
python/cuda_cccl/tests/compute/test_reduce_aot.py
python/cuda_cccl/tests/compute/test_scan_aot.py
python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
python/cuda_cccl/tests/compute/test_transform_aot.py
python/cuda_cccl/tests/compute/test_unique_by_key_aot.py

✅ Files skipped from review due to trivial changes (1)

c/parallel/include/cccl/c/scan.h

🚧 Files skipped from review as they are similar to previous changes (49)

python/cuda_cccl/tests/compute/test_merge_sort_aot.py
c/parallel/src/aot.cpp
c/parallel/include/cccl/c/for.h
python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_histogram.py
c/parallel/include/cccl/c/segmented_sort.h
c/parallel/include/cccl/c/binary_search.h
c/parallel/include/cccl/c/three_way_partition.h
python/cuda_cccl/tests/compute/test_unique_by_key_aot.py
c/parallel/include/cccl/c/histogram.h
c/parallel/include/cccl/c/merge_sort.h
python/cuda_cccl/tests/compute/test_transform_aot.py
c/parallel/include/cccl/c/radix_sort.h
python/cuda_cccl/tests/compute/test_reduce_aot.py
python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
c/parallel/include/cccl/c/reduce.h
c/parallel/src/for.cu
python/cuda_cccl/cuda/compute/algorithms/_transform.py
python/cuda_cccl/tests/compute/test_binary_search_aot.py
c/parallel/include/cccl/c/transform.h
python/cuda_cccl/tests/compute/test_histogram_aot.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
c/parallel/include/cccl/c/unique_by_key.h
c/parallel/src/radix_sort.cu
python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
python/cuda_cccl/cuda/compute/algorithms/_scan.py
c/parallel/include/cccl/c/segmented_reduce.h
c/parallel/src/segmented_sort.cu
python/cuda_cccl/cuda/compute/_bindings.pyi
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
c/parallel/src/reduce.cu
c/parallel/src/unique_by_key.cu
python/cuda_cccl/tests/compute/test_scan_aot.py
c/parallel/src/merge_sort.cu
c/parallel/src/binary_search.cu
c/parallel/src/util/aot_serialize.h
c/parallel/src/scan.cu
c/parallel/src/transform.cu
c/parallel/src/segmented_reduce.cu
c/parallel/test/test_reduce.cpp
c/parallel/src/three_way_partition.cu
c/parallel/src/histogram.cu
python/cuda_cccl/cuda/compute/_bindings_impl.pyx

coderabbitai

Actionable comments posted: 4

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d7560161-8bed-43e8-8952-5e8ca57c4952

📥 Commits

Reviewing files that changed from the base of the PR and between 08c3d03 and ca828a4.

📒 Files selected for processing (59)

c/parallel/include/cccl/c/aot.h
c/parallel/include/cccl/c/binary_search.h
c/parallel/include/cccl/c/for.h
c/parallel/include/cccl/c/histogram.h
c/parallel/include/cccl/c/merge_sort.h
c/parallel/include/cccl/c/radix_sort.h
c/parallel/include/cccl/c/reduce.h
c/parallel/include/cccl/c/scan.h
c/parallel/include/cccl/c/segmented_reduce.h
c/parallel/include/cccl/c/segmented_sort.h
c/parallel/include/cccl/c/three_way_partition.h
c/parallel/include/cccl/c/transform.h
c/parallel/include/cccl/c/unique_by_key.h
c/parallel/src/aot.cpp
c/parallel/src/binary_search.cu
c/parallel/src/for.cu
c/parallel/src/histogram.cu
c/parallel/src/merge_sort.cu
c/parallel/src/radix_sort.cu
c/parallel/src/reduce.cu
c/parallel/src/scan.cu
c/parallel/src/segmented_reduce.cu
c/parallel/src/segmented_sort.cu
c/parallel/src/three_way_partition.cu
c/parallel/src/transform.cu
c/parallel/src/unique_by_key.cu
c/parallel/src/util/aot_serialize.h
c/parallel/test/test_bench_aot.cpp
c/parallel/test/test_reduce.cpp
python/cuda_cccl/CMakeLists.txt
python/cuda_cccl/cuda/compute/_bindings.pyi
python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
python/cuda_cccl/cuda/compute/_bindings_aot_v2.pxi
python/cuda_cccl/cuda/compute/_bindings_impl.pyx
python/cuda_cccl/cuda/compute/_jit.py
python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
python/cuda_cccl/cuda/compute/algorithms/_histogram.py
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_scan.py
python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
python/cuda_cccl/cuda/compute/algorithms/_transform.py
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
python/cuda_cccl/cuda/compute/op.py
python/cuda_cccl/tests/compute/bench_aot.py
python/cuda_cccl/tests/compute/test_binary_search_aot.py
python/cuda_cccl/tests/compute/test_histogram_aot.py
python/cuda_cccl/tests/compute/test_merge_sort_aot.py
python/cuda_cccl/tests/compute/test_radix_sort_aot.py
python/cuda_cccl/tests/compute/test_reduce_aot.py
python/cuda_cccl/tests/compute/test_scan_aot.py
python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
python/cuda_cccl/tests/compute/test_transform_aot.py
python/cuda_cccl/tests/compute/test_unique_by_key_aot.py

✅ Files skipped from review due to trivial changes (3)

python/cuda_cccl/CMakeLists.txt
python/cuda_cccl/cuda/compute/_bindings_aot_v2.pxi
python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi

🚧 Files skipped from review as they are similar to previous changes (49)

c/parallel/include/cccl/c/three_way_partition.h
c/parallel/src/aot.cpp
python/cuda_cccl/cuda/compute/algorithms/_histogram.py
python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
c/parallel/include/cccl/c/reduce.h
python/cuda_cccl/tests/compute/test_histogram_aot.py
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
c/parallel/include/cccl/c/segmented_reduce.h
python/cuda_cccl/tests/compute/test_unique_by_key_aot.py
python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
c/parallel/include/cccl/c/histogram.h
c/parallel/include/cccl/c/binary_search.h
python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
c/parallel/include/cccl/c/segmented_sort.h
c/parallel/include/cccl/c/unique_by_key.h
python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
c/parallel/include/cccl/c/transform.h
python/cuda_cccl/tests/compute/test_binary_search_aot.py
python/cuda_cccl/tests/compute/test_transform_aot.py
c/parallel/src/binary_search.cu
c/parallel/include/cccl/c/merge_sort.h
python/cuda_cccl/tests/compute/test_radix_sort_aot.py
c/parallel/include/cccl/c/radix_sort.h
python/cuda_cccl/tests/compute/test_merge_sort_aot.py
python/cuda_cccl/tests/compute/bench_aot.py
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
python/cuda_cccl/tests/compute/test_scan_aot.py
python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
python/cuda_cccl/tests/compute/test_reduce_aot.py
python/cuda_cccl/cuda/compute/algorithms/_transform.py
c/parallel/src/segmented_reduce.cu
c/parallel/src/scan.cu
c/parallel/src/reduce.cu
c/parallel/src/three_way_partition.cu
c/parallel/src/radix_sort.cu
python/cuda_cccl/cuda/compute/_bindings.pyi
c/parallel/include/cccl/c/scan.h
c/parallel/src/segmented_sort.cu
c/parallel/src/histogram.cu
c/parallel/test/test_reduce.cpp
python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
c/parallel/src/unique_by_key.cu
c/parallel/src/util/aot_serialize.h
python/cuda_cccl/cuda/compute/algorithms/_scan.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
c/parallel/src/merge_sort.cu
c/parallel/src/transform.cu

shwina · 2026-06-24T18:56:09Z

@coderabbitai review

coderabbitai · 2026-06-24T18:56:16Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

♻️ Duplicate comments (2)

c/parallel/test/test_bench_aot.cpp (2)

73-89: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

important: WITH_TEMP frees _d_tmp only on the success path; a failing REQUIRE after cudaMalloc skips cudaFree, leaking device memory into subsequent tests. Use an RAII guard for the temp allocation.

122-210: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

important: build-result handles (link_build/full_build/jit_build) are cleaned up only on the success path; any failing REQUIRE between compile/load and _cleanup leaks payload/library/device resources. Wrap each handle in a scope guard so cleanup always runs.

Also applies to: 237-331, 361-455.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c3294930-3ff1-4cd6-9e0d-e4afabd394c4

📥 Commits

Reviewing files that changed from the base of the PR and between ca828a4 and c7719f8.

📒 Files selected for processing (59)

c/parallel/include/cccl/c/aot.h
c/parallel/include/cccl/c/binary_search.h
c/parallel/include/cccl/c/for.h
c/parallel/include/cccl/c/histogram.h
c/parallel/include/cccl/c/merge_sort.h
c/parallel/include/cccl/c/radix_sort.h
c/parallel/include/cccl/c/reduce.h
c/parallel/include/cccl/c/scan.h
c/parallel/include/cccl/c/segmented_reduce.h
c/parallel/include/cccl/c/segmented_sort.h
c/parallel/include/cccl/c/three_way_partition.h
c/parallel/include/cccl/c/transform.h
c/parallel/include/cccl/c/unique_by_key.h
c/parallel/src/aot.cpp
c/parallel/src/binary_search.cu
c/parallel/src/for.cu
c/parallel/src/histogram.cu
c/parallel/src/merge_sort.cu
c/parallel/src/radix_sort.cu
c/parallel/src/reduce.cu
c/parallel/src/scan.cu
c/parallel/src/segmented_reduce.cu
c/parallel/src/segmented_sort.cu
c/parallel/src/three_way_partition.cu
c/parallel/src/transform.cu
c/parallel/src/unique_by_key.cu
c/parallel/src/util/aot_serialize.h
c/parallel/test/test_bench_aot.cpp
c/parallel/test/test_reduce.cpp
python/cuda_cccl/CMakeLists.txt
python/cuda_cccl/cuda/compute/_bindings.pyi
python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
python/cuda_cccl/cuda/compute/_bindings_aot_v2.pxi
python/cuda_cccl/cuda/compute/_bindings_impl.pyx
python/cuda_cccl/cuda/compute/_jit.py
python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
python/cuda_cccl/cuda/compute/algorithms/_histogram.py
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_scan.py
python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
python/cuda_cccl/cuda/compute/algorithms/_transform.py
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
python/cuda_cccl/cuda/compute/op.py
python/cuda_cccl/tests/compute/bench_aot.py
python/cuda_cccl/tests/compute/test_binary_search_aot.py
python/cuda_cccl/tests/compute/test_histogram_aot.py
python/cuda_cccl/tests/compute/test_merge_sort_aot.py
python/cuda_cccl/tests/compute/test_radix_sort_aot.py
python/cuda_cccl/tests/compute/test_reduce_aot.py
python/cuda_cccl/tests/compute/test_scan_aot.py
python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
python/cuda_cccl/tests/compute/test_transform_aot.py
python/cuda_cccl/tests/compute/test_unique_by_key_aot.py

✅ Files skipped from review due to trivial changes (4)

c/parallel/src/aot.cpp
python/cuda_cccl/tests/compute/test_scan_aot.py
python/cuda_cccl/tests/compute/bench_aot.py
python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi

🚧 Files skipped from review as they are similar to previous changes (53)

python/cuda_cccl/cuda/compute/op.py
python/cuda_cccl/tests/compute/test_radix_sort_aot.py
c/parallel/include/cccl/c/scan.h
python/cuda_cccl/CMakeLists.txt
python/cuda_cccl/tests/compute/test_binary_search_aot.py
c/parallel/include/cccl/c/three_way_partition.h
python/cuda_cccl/tests/compute/test_histogram_aot.py
c/parallel/include/cccl/c/segmented_sort.h
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
python/cuda_cccl/cuda/compute/_bindings_aot_v2.pxi
c/parallel/include/cccl/c/unique_by_key.h
c/parallel/include/cccl/c/segmented_reduce.h
c/parallel/include/cccl/c/transform.h
python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
python/cuda_cccl/cuda/compute/algorithms/_histogram.py
c/parallel/include/cccl/c/binary_search.h
c/parallel/include/cccl/c/for.h
c/parallel/include/cccl/c/radix_sort.h
python/cuda_cccl/tests/compute/test_merge_sort_aot.py
c/parallel/include/cccl/c/merge_sort.h
python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
python/cuda_cccl/cuda/compute/algorithms/_transform.py
python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
c/parallel/src/for.cu
python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
c/parallel/src/merge_sort.cu
python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
python/cuda_cccl/tests/compute/test_reduce_aot.py
c/parallel/include/cccl/c/reduce.h
c/parallel/include/cccl/c/histogram.h
python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
python/cuda_cccl/cuda/compute/algorithms/_scan.py
python/cuda_cccl/cuda/compute/algorithms/_reduce.py
c/parallel/src/scan.cu
python/cuda_cccl/tests/compute/test_unique_by_key_aot.py
python/cuda_cccl/cuda/compute/_jit.py
c/parallel/src/util/aot_serialize.h
c/parallel/src/transform.cu
c/parallel/src/segmented_sort.cu
python/cuda_cccl/tests/compute/test_transform_aot.py
c/parallel/src/binary_search.cu
c/parallel/src/reduce.cu
c/parallel/src/radix_sort.cu
c/parallel/src/histogram.cu
c/parallel/src/three_way_partition.cu
c/parallel/test/test_reduce.cpp
c/parallel/src/unique_by_key.cu
python/cuda_cccl/cuda/compute/_bindings.pyi
c/parallel/src/segmented_reduce.cu
python/cuda_cccl/cuda/compute/_bindings_impl.pyx

shwina · 2026-06-24T19:52:58Z

Addressing the minor/nitpick comments from the review summaries:

aot_serialize.h — missing <type_traits> / <cstddef>
Fixed in commit 2c68ef6 — added both headers directly.

Angle-bracket vs. quote include syntax (histogram.cu, merge_sort.cu, radix_sort.cu, segmented_reduce.cu, segmented_sort.cu, transform.cu, test_bench_aot.cpp)
False positive for this PR. The project convention is "..." for project-local headers and <...> for installed/external headers — established by pre-existing includes like "util/context.h", "util/errors.h", "util/nvjitlink.h" in the same files. The new "util/aot_serialize.h" includes follow that convention exactly. A broader style-cleanup (unifying on <> everywhere) is a separate concern.

_bindings_impl.pyx:1175-1177 — &view[0] guard on empty blob
False positive for the current code. The AoT deserialization code was moved from _bindings_impl.pyx into _bindings_aot_v1.pxi during this PR. That file already guards before indexing: if buf_size == 0: raise ... precedes cdef const void* buf_ptr = <const void*>&view[0] in every deserialize function.

test_merge_sort_aot.py:78 — np.argsort(..., stable=True) requires NumPy 2.0
False positive for this PR. stable=True is the existing project convention — it appears in 20+ pre-existing test files (test_merge_sort.py, test_radix_sort.py, test_segmented_sort.py, etc.). The new test is consistent with those.

_scan.py:407-414 — load_inclusive_scan missing init_value=None default
Stale comment — this function no longer exists in the current file (the file is 389 lines; the function was present in an earlier commit reviewed by CodeRabbit).

for.h:81-85 — document out-buffer ownership / deserialize semantics
Fixed in commit 262c509, which updated the cccl_device_for_deserialize doc comment to document the LTOIR→link_ltoir prerequisite and the failure semantics.

test_reduce.cpp:972-974 / test_reduce_aot.py:117-119 — strengthen failure-state assertions / add execution call
Valid test-quality improvements; deferred to a follow-up.

bench_aot.py — benchmark order neutrality, correctness assertions
Valid observations; the benchmark is an informal timing script. Deferred to a follow-up.

_transform.py:52-58 / _unique_by_key.py:74-83 — deserialize-first ordering
Valid micro-optimization; deferred to a follow-up.

aot_serialize.h:30-32 — per-algorithm kFormatVersion
Valid forward-compat observation. In practice, CCCL_VERSION (which bumps with every release) is mixed into the ABI hash alongside the algorithm name, so real-world layout breaks are caught at version boundaries. True per-algorithm versioning would be finer-grained; deferred to a follow-up.

shwina · 2026-06-26T11:03:12Z

+// Version of the CCCL C parallel AoT blob format. Increment whenever a
+// breaking change is made to the C parallel layer (build_result layouts,
+// serialization wire format, etc.) independently of CCCL_VERSION.
+#define CCCL_C_PARALLEL_VERSION 1


If we change the ABI of a build result, this needs to change.

We should start thinking about semantic versioning for the C library.

Are you introducing this now to account for loading a build struct from a previous version?

shwina · 2026-06-26T12:48:03Z

@coderabbitai review

coderabbitai · 2026-06-26T12:48:09Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 13

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cbc0ef38-9a40-4993-a4bd-b157d2e01fc9

📥 Commits

Reviewing files that changed from the base of the PR and between c7719f8 and 99ae64e.

📒 Files selected for processing (29)

c/parallel/include/cccl/c/aot.h
c/parallel/include/cccl/c/binary_search.h
c/parallel/include/cccl/c/for.h
c/parallel/include/cccl/c/histogram.h
c/parallel/include/cccl/c/merge_sort.h
c/parallel/include/cccl/c/radix_sort.h
c/parallel/include/cccl/c/reduce.h
c/parallel/include/cccl/c/scan.h
c/parallel/include/cccl/c/segmented_reduce.h
c/parallel/include/cccl/c/segmented_sort.h
c/parallel/include/cccl/c/three_way_partition.h
c/parallel/include/cccl/c/transform.h
c/parallel/include/cccl/c/unique_by_key.h
c/parallel/src/aot.cpp
c/parallel/src/binary_search.cu
c/parallel/src/for.cu
c/parallel/src/histogram.cu
c/parallel/src/merge_sort.cu
c/parallel/src/radix_sort.cu
c/parallel/src/reduce.cu
c/parallel/src/scan.cu
c/parallel/src/segmented_reduce.cu
c/parallel/src/segmented_sort.cu
c/parallel/src/three_way_partition.cu
c/parallel/src/transform.cu
c/parallel/src/unique_by_key.cu
c/parallel/src/util/aot_serialize.h
c/parallel/test/test_bench_aot.cpp
c/parallel/test/test_reduce.cpp

✅ Files skipped from review due to trivial changes (1)

c/parallel/src/aot.cpp

🚧 Files skipped from review as they are similar to previous changes (13)

c/parallel/include/cccl/c/merge_sort.h
c/parallel/include/cccl/c/histogram.h
c/parallel/include/cccl/c/for.h
c/parallel/include/cccl/c/reduce.h
c/parallel/include/cccl/c/segmented_reduce.h
c/parallel/include/cccl/c/radix_sort.h
c/parallel/include/cccl/c/binary_search.h
c/parallel/include/cccl/c/three_way_partition.h
c/parallel/include/cccl/c/scan.h
c/parallel/include/cccl/c/unique_by_key.h
c/parallel/include/cccl/c/segmented_sort.h
c/parallel/include/cccl/c/transform.h
c/parallel/test/test_reduce.cpp

NaderAlAwar · 2026-06-26T19:52:56Z

+// Version of the CCCL C parallel AoT blob format. Increment whenever a
+// breaking change is made to the C parallel layer (build_result layouts,
+// serialization wire format, etc.) independently of CCCL_VERSION.
+#define CCCL_C_PARALLEL_VERSION 1


Are you introducing this now to account for loading a build struct from a previous version?

NaderAlAwar · 2026-06-26T20:03:22Z

+// allocated with new[] (matching cccl_aot_buffer_free, which does delete[]).
+class buffer_writer
+{
+  std::vector<char> data_;


Nit: we typically don't use the trailing underscore pattern

NaderAlAwar · 2026-06-26T20:03:27Z

+  const char* p_;
+  size_t remaining_;


Nit: we typically don't use the trailing underscore pattern

NaderAlAwar · 2026-06-26T20:09:38Z

+  std::unique_ptr<char[]> n_reduction{r.read_cstring_dup()};
+  std::unique_ptr<char[]> n_nondeterministic{r.read_cstring_dup()};
+
+  build_ptr->cc                                     = static_cast<int>(h.cc);


Important: this starts populating the caller-provided build result without first zeroing it or assigning through a local zero-initialized result. The public header says deserialized CUlibrary/CUkernel handles remain null until load and that failure leaves the build zeroed, but this path never clears nonserialized fields like library/single_tile_kernel/reduction_kernel, and the catch path returns without zeroing build_ptr. We should either update the comment or implement the behavior it currently describes. Same applies to all algorithms

NaderAlAwar · 2026-06-26T20:11:31Z

+    throw std::runtime_error("aot blob: empty payload");
+  }
+
+  std::unique_ptr<cub::detail::reduce::policy_selector, decltype(&std::free)> policy(


Critical: this ignores the serialized determinism value and always reconstructs runtime_policy as cub::detail::reduce::policy_selector. For CCCL_NOT_GUARANTEED, serialization stores a cub::detail::reduce_nondeterministic::policy_selector, and cccl_device_reduce_nondeterministic later casts runtime_policy back to that nondeterministic type.

NaderAlAwar · 2026-06-26T20:12:43Z

+  uint64_t code_n = r.read_pod<uint64_t>();
+  if (code_n > 0)
+  {
+    void* p = std::malloc(static_cast<size_t>(code_n));


Important: this raw allocation can leak if a later read or validation throws. For example, r.read_bytes at line 1110 can throw before p is assigned into op.code, the state-size check can throw after op.code is assigned, and the state read at line 1134 can throw after both allocations exist. Please hold code/state in local RAII owners with std::free deleters and release them into cccl_op_t only after the entire selector op has been successfully decoded

NaderAlAwar · 2026-06-26T20:14:32Z

+
+  const auto key_t    = read_type_info(r);
+  const auto offset_t = read_type_info(r);
+  const auto order    = static_cast<cccl_sort_order_t>(r.read_pod<uint32_t>());


Suggestion: in radix_sort we validate that the sort order is valid (either ASCENDING or DESCENDING) after reading it. For consistency we should do the same here

…ithms Builds on the compile/load split (NVIDIA#8484) by adding cccl_device_<algo>_serialize / _deserialize for every C parallel algorithm, plus a versioned on-disk blob format: * cccl/c/aot.h - algorithm tags + cccl_aot_buffer_free * util/aot_serialize.h - blob_header (magic + ABI hash + payload kind), FNV-1a ABI hashing, bounds-checked reader/writer A blob stores the compiled payload, runtime policy and lowered kernel names; CUlibrary/CUkernel handles are reconstructed on load. The header's magic / algo tag / format version / ABI hash are validated on deserialize so stale or mismatched artifacts are rejected. Lifetime fixes: deserialize runtime_policy uses malloc (matching cleanup); serialize early-return paths zero *out_buf/*out_size. Review feedback: - Document cccl_aot_buffer_free ownership in all serialize/deserialize declarations (was already in reduce.h/scan.h, now in all 12 headers). - Reject empty payloads in deserialize (crafted zero-length payload blob would deserialize to CUDA_SUCCESS with payload==nullptr). - Empty byte input to Cython deserialize wrappers now raises RuntimeError before the unsafe &view[0] index rather than IndexError. - assert -> ValueError for d_in_values/d_out_values mismatch in _MergeSort. Reduce gains serialize/deserialize round-trip + bad-blob tests, and the AoT first-execution latency benchmark (test_bench_aot.cpp) is added, guarded to the v1 (NVRTC) backend. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…JIT skip Deserialize catch handlers were calling memset(build_ptr, 0) without first releasing any pre-existing resources in build_ptr. If the caller passed a previously-populated result, its payload/library/kernel-name allocations would be leaked. Add the matching cleanup call before the memset in every *_deserialize catch handler (10 algorithms). Also fix _MergeSort.deserialize in Python: was calling compile() instead of compile_for_load(), which triggered a full JIT recompile on load and defeated the purpose of AoT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add cccl_device_transform_cleanup(build_ptr) before std::memset in the cccl_device_transform_deserialize catch handler (same fix applied to the other 10 algorithms in the previous commit) - Document in cccl_device_for_deserialize that LTOIR payloads require cccl_device_for_link_ltoir before cccl_device_for_load Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

These headers are used directly (std::is_trivially_copyable_v, size_t) and should not rely on transitive includes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Groups of fixes, all in the C parallel layer: Group 1 — Commit-on-success for deserialize (all 12 algos): - Remove success-path memset(build_ptr) — contract is caller passes zeroed output; the memset was redundant on success and harmful if build_ptr carried existing owned data. - Strip cleanup+memset from catch blocks — on failure build_ptr is left unchanged, avoiding destruction of the caller's valid result. - binary_search: use a local temp result{} and commit *build_ptr = result after cccl_device_transform_deserialize succeeds, so the nested call cannot leave *build_ptr half-written on error. Group 2 — Serialize output deterministic on failure (6 algos): - Zero *out_buf/*out_size before the first throwing operation in serialize so any exception leaves the outputs null/zero rather than whatever the caller passed in. Group 3 — Input validation: - for.cu serialize: reject unsupported payload_kind and null/empty static_kernel_lowered_name before writing the blob header. - for.cu deserialize: validate the read kernel name is non-empty. - binary_search.cu deserialize: validate state_align is a non-zero power-of-two when state_size > 0. - radix_sort.cu deserialize: reject unrecognized sort-order enum values. - segmented_sort.cu deserialize: validate selector-op state blob length against op.size before allocating. - transform.cu serialize+deserialize: validate runtime_policy_size against the two supported policy_selector layouts. Group 4 — Test RAII (test_bench_aot.cpp): - WITH_TEMP macro: wrap _d_tmp in a _CudaFreeGuard struct so device memory is released even when an intermediate REQUIRE throws. - Build results: add BuildGuard<T,Cleanup> RAII template; wrap link_build / full_build / jit_build in guards for both reduce and scan sections so cleanup runs on REQUIRE failure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…egmented_sort, sort order validation - reduce.cu: branch on determinism when allocating/reading runtime_policy so CCCL_NOT_GUARANTEED blobs use reduce_nondeterministic::policy_selector (the load path casts back to that type); also use local result{} and commit-on-success so unset fields (library, kernel handles) are zero - segmented_sort.cu: hold code/state malloc in unique_ptr<void,free> inside deserialize_selector_op so a throw during read_bytes doesn't leak the allocation; add sort-order enum validation for consistency with radix_sort - aot_serialize.h: remove trailing underscores from private member names (data_, p_, remaining_) per project style Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…lize functions Use a local zero-initialized result{} and assign *build_ptr = result at the end of each deserialize function so that load-only fields (CUlibrary, CUkernel handles) are always zero in the returned struct, regardless of what the caller passed in. Applies to: scan, transform, three_way_partition, unique_by_key, histogram, merge_sort, segmented_reduce, radix_sort, segmented_sort. (binary_search, for, and reduce already used this pattern.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-06-29T03:10:47Z

🥳 CI Workflow Results

🟩 Finished in 1h 17m: Pass: 100%/20 | Total: 5h 32m | Max: 51m 55s | Hits: 82%/2034

See results here.

shwina requested review from a team as code owners June 23, 2026 16:08

shwina requested review from NaderAlAwar and kkraus14 June 23, 2026 16:08

github-project-automation Bot added this to CCCL Jun 23, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 23, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 23, 2026

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

shwina force-pushed the pr/aot-python-layer branch from 56a7df5 to dd3980b Compare June 23, 2026 17:22

shwina force-pushed the pr/aot-python-layer branch from dd3980b to ad6c8b1 Compare June 23, 2026 17:26

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

shwina force-pushed the pr/aot-python-layer branch from ad6c8b1 to 08c3d03 Compare June 23, 2026 20:53

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread c/parallel/test/test_bench_aot.cpp Outdated

This comment has been minimized.

Sign in to view

shwina force-pushed the pr/aot-python-layer branch from 08c3d03 to ca828a4 Compare June 24, 2026 18:13

shwina requested a review from a team as a code owner June 24, 2026 18:13

shwina requested a review from wmaxey June 24, 2026 18:13

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread c/parallel/include/cccl/c/for.h Outdated

Comment thread c/parallel/src/for.cu Outdated

Comment thread c/parallel/src/for.cu

Comment thread python/cuda_cccl/cuda/compute/op.py Outdated

shwina force-pushed the pr/aot-python-layer branch from ca828a4 to c7719f8 Compare June 24, 2026 18:50

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

shwina force-pushed the pr/aot-python-layer branch from 262c509 to 29a5dfb Compare June 24, 2026 19:47

This comment has been minimized.

Sign in to view

shwina force-pushed the pr/aot-python-layer branch from bc079ac to dc94025 Compare June 25, 2026 17:22

shwina changed the title ~~[cuda.compute]: Add .save() and .load() methods to enable ahead-of-time compilation workflows~~ [cccl.c]: Add serialize() and deserialize() functions to enable ahead-of-time compilation workflows Jun 25, 2026

This comment has been minimized.

Sign in to view

shwina force-pushed the pr/aot-python-layer branch from dc94025 to 99ae64e Compare June 25, 2026 20:09

This comment has been minimized.

Sign in to view

shwina commented Jun 26, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

shwina force-pushed the pr/aot-python-layer branch from 99ae64e to 8636ee0 Compare June 26, 2026 14:23

This comment has been minimized.

Sign in to view

NaderAlAwar reviewed Jun 26, 2026

View reviewed changes

shwina and others added 9 commits June 28, 2026 21:51

[cccl.c] Add missing <type_traits> and <cstddef> to aot_serialize.h

5eda56b

These headers are used directly (std::is_trivially_copyable_v, size_t) and should not rely on transitive includes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove abi hash stuff - rely on a central ABI version

db8d570

Remove test_bench_aot.cpp

244366b

shwina force-pushed the pr/aot-python-layer branch from 9b5d575 to 172604e Compare June 29, 2026 01:51

Uh oh!

Conversation

shwina commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 23, 2026

Uh oh!

shwina commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shwina commented Jun 24, 2026

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

shwina commented Jun 24, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shwina commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

shwina commented Jun 23, 2026 •

edited

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading