Skip to content

[cccl.c]: Add serialize() and deserialize() functions to enable ahead-of-time compilation workflows#9568

Open
shwina wants to merge 9 commits into
NVIDIA:mainfrom
shwina:pr/aot-python-layer
Open

[cccl.c]: Add serialize() and deserialize() functions to enable ahead-of-time compilation workflows#9568
shwina wants to merge 9 commits into
NVIDIA:mainfrom
shwina:pr/aot-python-layer

Conversation

@shwina

@shwina shwina commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Description

This PR adds cccl_<algo>_serialize and cccl_<algo>_deserialize APIs to all algorithms, which serialize/deserialize the <algo>_build_result_t structs into bytes. Callers can, e.g., write/read these bytes to disk to implement an ahead-of-time compilation workflow; or send bytes across the wire in a multi-node workflow.

Some notes:

  1. These changes apply only to v1, v2 build structs have a totally different composition - we'll deal with that later.
  2. The variable CCCL_C_PARALLEL_VERSION defined in aot.h represents the ABI version. Any changes in the ABI of the build structs should result in this version being bumped (currently set to 1).
  3. There are probably clever ways to cut down on code duplication between all the implementations. I decided to keep things where they are given that v1 is on its way out.
  4. I only added a test for round-tripping reduce_into. There will be a follow-up Python PR with more extensive testing.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shwina shwina requested review from a team as code owners June 23, 2026 16:08
@shwina shwina requested review from NaderAlAwar and kkraus14 June 23, 2026 16:08
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 23, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 23, 2026
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds experimental C AoT persistence for device build results: shared blob-format helpers and a free function, serialize/deserialize declarations, implementations for all algorithms, and benchmark/test coverage for round-trips and failures.

Changes

AOT build-result persistence

Layer / File(s) Summary
Blob format and buffer API
c/parallel/include/cccl/c/aot.h, c/parallel/src/util/aot_serialize.h, c/parallel/src/aot.cpp
Defines the experimental C AOT algorithm tags and version marker, the shared blob header and buffer reader/writer helpers, type-info serialization helpers, and the cccl_aot_buffer_free export.
Public C declarations
c/parallel/include/cccl/c/{binary_search.h,for.h,histogram.h,merge_sort.h,radix_sort.h,reduce.h,scan.h,segmented_reduce.h,segmented_sort.h,three_way_partition.h,transform.h,unique_by_key.h}
Adds serialize/deserialize declarations across the public C headers, with output-buffer ownership, load sequencing, and failure-zeroing contracts.
Single-level round-trips
c/parallel/src/{for.cu,reduce.cu,scan.cu,transform.cu,three_way_partition.cu,unique_by_key.cu,segmented_reduce.cu}
Implements serialize/deserialize pairs for the single-level algorithms, writing versioned blobs and restoring payload, runtime-policy, and lowered-name fields.
Nested round-trips
c/parallel/src/{binary_search.cu,histogram.cu,merge_sort.cu,radix_sort.cu,segmented_sort.cu}
Implements nested blob handling for binary_search, histogram, merge_sort, radix_sort, and segmented_sort, including selector-op state and multiple embedded blobs.
Benchmarks and tests
c/parallel/test/test_bench_aot.cpp, c/parallel/test/test_reduce.cpp
Adds AOT benchmark coverage for reduce, scan, and merge_sort, plus reduce cubin/LTOIR round-trip and negative deserialization tests.

Suggested reviewers

  • fbusato
  • NaderAlAwar

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (9)
c/parallel/src/histogram.cu-26-26 (1)

26-26: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

suggestion: use angle-bracket include syntax for the new util header.

As per coding guidelines, “All header inclusions must use the syntax <header>.”

Source: Coding guidelines

c/parallel/src/merge_sort.cu-28-28 (1)

28-28: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

suggestion: use angle-bracket include syntax for the new util header.

As per coding guidelines, “All header inclusions must use the syntax <header>.”

Source: Coding guidelines

c/parallel/src/radix_sort.cu-26-26 (1)

26-26: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

suggestion: use angle-bracket include syntax for the new util header.

As per coding guidelines, “All header inclusions must use the syntax <header>.”

Source: Coding guidelines

c/parallel/src/segmented_sort.cu-35-35 (1)

35-35: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

suggestion: use angle-bracket include syntax for the new util header.

As per coding guidelines, “All header inclusions must use the syntax <header>.”

Source: Coding guidelines

c/parallel/test/test_bench_aot.cpp-33-33 (1)

33-33: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

important: #include "test_util.h" should use angle-bracket include form in this codebase to satisfy repository header-inclusion policy.

As per coding guidelines, "All header inclusions must use the syntax <header>".

Source: Coding guidelines

python/cuda_cccl/tests/compute/test_merge_sort_aot.py-78-78 (1)

78-78: 🎯 Functional Correctness | 🟡 Minor

important: Line 78 uses np.argsort(..., stable=True), which requires NumPy 2.0.0+. Since cuda-cccl does not specify a minimum NumPy version, use kind="stable" instead, available since NumPy 1.15.0, to preserve stable ordering with broader compatibility.

c/parallel/src/segmented_reduce.cu-34-35 (1)

34-35: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

suggestion: Use angle-bracket syntax for these local utility includes.

The changed include block should use <util/aot_serialize.h> and <util/nvjitlink.h>. As per coding guidelines, "All header inclusions must use the syntax <header>."

Source: Coding guidelines

python/cuda_cccl/cuda/compute/_bindings_impl.pyx-1175-1177 (1)

1175-1177: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

important: Guard empty blobs before taking &view[0].

An empty .aot file currently raises IndexError in Cython before the native deserializer can return CUDA_ERROR_INVALID_VALUE. Check view.shape[0] == 0 before indexing, or pass NULL with size 0 and let the backend produce the consistent deserialize error.

Also applies to: 1541-1543, 1725-1727, 1905-1907, 2093-2095, 2287-2289, 2456-2458, 2566-2568, 2765-2767, 2932-2934, 3124-3126, 3318-3320

python/cuda_cccl/cuda/compute/algorithms/_scan.py-407-414 (1)

407-414: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

suggestion: Give load_inclusive_scan the same init_value=None default as make_inclusive_scan and inclusive_scan. The reconstruction path already handles None, but the public loader currently raises TypeError unless callers pass init_value=None explicitly for no-init inclusive scans.

🧹 Nitpick comments (3)
c/parallel/test/test_reduce.cpp (1)

972-974: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

suggestion: The test says failure leaves the build zeroed, but it only checks build.payload. Add checks for other pointer fields (for example runtime_policy and single_tile_kernel_lowered_name) so partial-deserialize regressions are caught.

python/cuda_cccl/tests/compute/test_reduce_aot.py (1)

117-119: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

suggestion: This test verifies the determinism field value but does not execute loaded. Add one execution call plus output assertion so _from_serialized path selection is validated behaviorally, not only via metadata.

python/cuda_cccl/tests/compute/bench_aot.py (1)

83-127: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

suggestion: Add lightweight correctness assertions in each timed path (reduce sum, scan boundary values, merge_sort first/last key). Without output checks, the benchmark can report latency for a semantically broken execution path.

Also applies to: 141-178, 197-241


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 49794fee-65b5-440b-809a-a43c062878f3

📥 Commits

Reviewing files that changed from the base of the PR and between f5c350b and 56a7df5.

📒 Files selected for processing (57)
  • c/parallel/include/cccl/c/aot.h
  • c/parallel/include/cccl/c/binary_search.h
  • c/parallel/include/cccl/c/for.h
  • c/parallel/include/cccl/c/histogram.h
  • c/parallel/include/cccl/c/merge_sort.h
  • c/parallel/include/cccl/c/radix_sort.h
  • c/parallel/include/cccl/c/reduce.h
  • c/parallel/include/cccl/c/scan.h
  • c/parallel/include/cccl/c/segmented_reduce.h
  • c/parallel/include/cccl/c/segmented_sort.h
  • c/parallel/include/cccl/c/three_way_partition.h
  • c/parallel/include/cccl/c/transform.h
  • c/parallel/include/cccl/c/unique_by_key.h
  • c/parallel/src/aot.cpp
  • c/parallel/src/binary_search.cu
  • c/parallel/src/for.cu
  • c/parallel/src/histogram.cu
  • c/parallel/src/merge_sort.cu
  • c/parallel/src/radix_sort.cu
  • c/parallel/src/reduce.cu
  • c/parallel/src/scan.cu
  • c/parallel/src/segmented_reduce.cu
  • c/parallel/src/segmented_sort.cu
  • c/parallel/src/three_way_partition.cu
  • c/parallel/src/transform.cu
  • c/parallel/src/unique_by_key.cu
  • c/parallel/src/util/aot_serialize.h
  • c/parallel/test/test_bench_aot.cpp
  • c/parallel/test/test_reduce.cpp
  • python/cuda_cccl/cuda/compute/__init__.py
  • python/cuda_cccl/cuda/compute/_bindings.pyi
  • python/cuda_cccl/cuda/compute/_bindings_impl.pyx
  • python/cuda_cccl/cuda/compute/algorithms/__init__.py
  • python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
  • python/cuda_cccl/cuda/compute/algorithms/_histogram.py
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_scan.py
  • python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/__init__.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
  • python/cuda_cccl/cuda/compute/algorithms/_transform.py
  • python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
  • python/cuda_cccl/tests/compute/bench_aot.py
  • python/cuda_cccl/tests/compute/test_binary_search_aot.py
  • python/cuda_cccl/tests/compute/test_histogram_aot.py
  • python/cuda_cccl/tests/compute/test_merge_sort_aot.py
  • python/cuda_cccl/tests/compute/test_radix_sort_aot.py
  • python/cuda_cccl/tests/compute/test_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_scan_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
  • python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
  • python/cuda_cccl/tests/compute/test_transform_aot.py
  • python/cuda_cccl/tests/compute/test_unique_by_key_aot.py

Comment thread c/parallel/include/cccl/c/transform.h
Comment thread c/parallel/src/binary_search.cu Outdated
Comment thread c/parallel/src/for.cu Outdated
Comment thread c/parallel/src/histogram.cu
Comment thread c/parallel/src/merge_sort.cu
Comment thread c/parallel/test/test_bench_aot.cpp Outdated
Comment thread python/cuda_cccl/cuda/compute/algorithms/_binary_search.py Outdated
Comment thread python/cuda_cccl/cuda/compute/algorithms/_reduce.py Outdated
Comment thread python/cuda_cccl/cuda/compute/algorithms/_scan.py Outdated
Comment thread python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py Outdated
@shwina shwina force-pushed the pr/aot-python-layer branch from 56a7df5 to dd3980b Compare June 23, 2026 17:22
@copy-pr-bot

copy-pr-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shwina

shwina commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@shwina shwina force-pushed the pr/aot-python-layer branch from dd3980b to ad6c8b1 Compare June 23, 2026 17:26

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (3)
python/cuda_cccl/cuda/compute/_bindings_impl.pyx-1175-1177 (1)

1175-1177: 🎯 Functional Correctness | 🟡 Minor

important: Guard &view[0] indexing in deserialize methods to preserve backend error handling.

Unconditional indexing of view[0] causes an IndexError when blob is empty (b""), preventing the backend deserializer from returning its intended error. Check buf_size first and pass NULL when zero:

Diff for lines 1175–1177 (and apply to all 12 deserialize methods)
         cdef const unsigned char[::1] view = blob
-        cdef const void* buf_ptr = <const void*>&view[0]
         cdef size_t buf_size = view.shape[0]
+        cdef const void* buf_ptr = NULL
+        if buf_size != 0:
+            buf_ptr = <const void*>&view[0]

Applies to all twelve deserialize() methods at lines 1176, 1542, 1726, 1906, 2094, 2288, 2457, 2567, 2766, 2933, 3125, 3319.

Source: Path instructions

c/parallel/src/util/aot_serialize.h-13-19 (1)

13-19: 📐 Maintainability & Code Quality | 🟡 Minor

important: add direct includes for <type_traits> and <cstddef>. Line 94 uses std::is_trivially_copyable_v and lines 81, 110, 119 use size_t without including the headers that define these symbols. Per coding guidelines, direct includes are required; relying on transitive includes is not allowed.

Source: Coding guidelines

c/parallel/src/transform.cu-39-39 (1)

39-39: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

important: use angle-bracket syntax for this project header include.

#include "util/aot_serialize.h" violates the repository rule that all header inclusions use <header> syntax; use <util/aot_serialize.h>. As per coding guidelines, “All header inclusions must use the syntax <header>.”

Source: Coding guidelines

🧹 Nitpick comments (6)
c/parallel/test/test_reduce.cpp (1)

972-973: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

suggestion: Strengthen failure-state assertions in the bad-blob test. Only checking build.payload can miss regressions where other owned pointers remain non-null after deserialize failure; also assert runtime_policy and lowered-name pointers are null to fully validate the zeroed-build contract.

As per path instructions, focus on C API error/status handling and lifetime ownership for c/**/*.

Source: Path instructions

python/cuda_cccl/tests/compute/bench_aot.py (1)

47-62: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

suggestion: Make the benchmark order-neutral to reduce cold-start bias. Running JIT first and AoT second in every trial systematically gives AoT the second-run advantage; run both orders (or alternate order across repetitions) and report median/mean per mode.

As per path instructions, focus on Python test/example correctness for python/cuda_cccl/**/*.

Source: Path instructions

python/cuda_cccl/cuda/compute/algorithms/_transform.py (1)

52-58: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

suggestion: Deserialize the blob first, then rebuild iterators/op.
Line 52 and Line 135 currently do iterator conversion and make_op_adapter(op).compile(...) before blob validation. Reordering to call _bindings.DeviceUnaryTransform.deserialize(blob) / _bindings.DeviceBinaryTransform.deserialize(blob) first avoids unnecessary work on invalid blobs and keeps the AOT load path leaner for callable ops. As per path instructions, python/cuda_cccl/**/*: “Focus on ... JIT/NVRTC/nvJitLink behavior”.

Also applies to: 135-143

Source: Path instructions

python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py (1)

74-83: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

suggestion: Fail fast on blob parse before doing op compilation.
Line 74 initializes iterators/op state before DeviceUniqueByKeyBuildResult.deserialize(blob) at Line 82. Swapping this order makes corrupted-blob handling cheaper and avoids unnecessary compile/setup work in the deserialize path. As per path instructions, python/cuda_cccl/**/*: “Focus on ... JIT/NVRTC/nvJitLink behavior”.

Source: Path instructions

c/parallel/src/util/aot_serialize.h (1)

30-32: 🗄️ Data Integrity & Integration | 🔵 Trivial | 🏗️ Heavy lift

important: make the blob format version algorithm-specific.

kFormatVersion is shared by every algorithm even though the comment and blob header describe a per-algorithm layout version. A future layout bump for one algorithm would reject saved blobs for every other algorithm; pass the expected/written version from each algorithm or explicitly document this as a global format version.

As per path instructions, focus on C API/ABI stability, error/status handling, lifetime ownership, runtime policy handling, build-result caching, C/C++ boundary behavior, and coverage for public C parallel APIs.

Also applies to: 240-267

Source: Path instructions

c/parallel/include/cccl/c/for.h (1)

81-85: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

suggestion: document out-buffer ownership and deserialize failure semantics for this public C API. This declaration block currently omits the allocator-boundary contract (cccl_aot_buffer_free) and output-state guarantees on deserialize failure, which increases misuse risk for external C callers; mirror the reduce.h contract text here (and in the other newly added algorithm headers for consistency). As per path instructions, “Focus on C API/ABI stability, error/status handling, lifetime ownership, [and] C/C++ boundary behavior.”

Source: Path instructions


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ccb178f0-abaa-425c-a224-a7ebbc1cf703

📥 Commits

Reviewing files that changed from the base of the PR and between 3726b6c and ad6c8b1.

📒 Files selected for processing (54)
  • c/parallel/include/cccl/c/aot.h
  • c/parallel/include/cccl/c/binary_search.h
  • c/parallel/include/cccl/c/for.h
  • c/parallel/include/cccl/c/histogram.h
  • c/parallel/include/cccl/c/merge_sort.h
  • c/parallel/include/cccl/c/radix_sort.h
  • c/parallel/include/cccl/c/reduce.h
  • c/parallel/include/cccl/c/scan.h
  • c/parallel/include/cccl/c/segmented_reduce.h
  • c/parallel/include/cccl/c/segmented_sort.h
  • c/parallel/include/cccl/c/three_way_partition.h
  • c/parallel/include/cccl/c/transform.h
  • c/parallel/include/cccl/c/unique_by_key.h
  • c/parallel/src/aot.cpp
  • c/parallel/src/binary_search.cu
  • c/parallel/src/for.cu
  • c/parallel/src/histogram.cu
  • c/parallel/src/merge_sort.cu
  • c/parallel/src/radix_sort.cu
  • c/parallel/src/reduce.cu
  • c/parallel/src/scan.cu
  • c/parallel/src/segmented_reduce.cu
  • c/parallel/src/segmented_sort.cu
  • c/parallel/src/three_way_partition.cu
  • c/parallel/src/transform.cu
  • c/parallel/src/unique_by_key.cu
  • c/parallel/src/util/aot_serialize.h
  • c/parallel/test/test_bench_aot.cpp
  • c/parallel/test/test_reduce.cpp
  • python/cuda_cccl/cuda/compute/_bindings.pyi
  • python/cuda_cccl/cuda/compute/_bindings_impl.pyx
  • python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
  • python/cuda_cccl/cuda/compute/algorithms/_histogram.py
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_scan.py
  • python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
  • python/cuda_cccl/cuda/compute/algorithms/_transform.py
  • python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
  • python/cuda_cccl/tests/compute/bench_aot.py
  • python/cuda_cccl/tests/compute/test_binary_search_aot.py
  • python/cuda_cccl/tests/compute/test_histogram_aot.py
  • python/cuda_cccl/tests/compute/test_merge_sort_aot.py
  • python/cuda_cccl/tests/compute/test_radix_sort_aot.py
  • python/cuda_cccl/tests/compute/test_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_scan_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
  • python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
  • python/cuda_cccl/tests/compute/test_transform_aot.py
  • python/cuda_cccl/tests/compute/test_unique_by_key_aot.py

Comment thread c/parallel/include/cccl/c/binary_search.h
Comment thread c/parallel/include/cccl/c/histogram.h
Comment thread c/parallel/include/cccl/c/merge_sort.h
Comment thread c/parallel/include/cccl/c/radix_sort.h
Comment thread c/parallel/include/cccl/c/three_way_partition.h
Comment thread c/parallel/src/unique_by_key.cu
Comment thread python/cuda_cccl/cuda/compute/algorithms/_binary_search.py Outdated
Comment thread python/cuda_cccl/cuda/compute/algorithms/_scan.py Outdated
Comment thread python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py Outdated
Comment thread python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py Outdated
@shwina shwina force-pushed the pr/aot-python-layer branch from ad6c8b1 to 08c3d03 Compare June 23, 2026 20:53

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
python/cuda_cccl/tests/compute/bench_aot.py (1)

47-63: 🎯 Functional Correctness | 🔵 Trivial

Consider alternating benchmark order to eliminate JIT warm-up bias. The clear_all_caches() calls only clear Python-level function caches, not the CUDA runtime's module cache. Since JIT runs first (line 52), any compiled kernels and driver state remain in memory for the AoT path (line 59), potentially lowering AoT timing. Run both order sequences (JIT→AoT and AoT→JIT) and report median, or deserialize in isolated subprocesses for true cold-start measurement.

Source: Path instructions


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a594ceb0-c175-4514-b3f6-b767133bf1dc

📥 Commits

Reviewing files that changed from the base of the PR and between ad6c8b1 and 08c3d03.

📒 Files selected for processing (54)
  • c/parallel/include/cccl/c/aot.h
  • c/parallel/include/cccl/c/binary_search.h
  • c/parallel/include/cccl/c/for.h
  • c/parallel/include/cccl/c/histogram.h
  • c/parallel/include/cccl/c/merge_sort.h
  • c/parallel/include/cccl/c/radix_sort.h
  • c/parallel/include/cccl/c/reduce.h
  • c/parallel/include/cccl/c/scan.h
  • c/parallel/include/cccl/c/segmented_reduce.h
  • c/parallel/include/cccl/c/segmented_sort.h
  • c/parallel/include/cccl/c/three_way_partition.h
  • c/parallel/include/cccl/c/transform.h
  • c/parallel/include/cccl/c/unique_by_key.h
  • c/parallel/src/aot.cpp
  • c/parallel/src/binary_search.cu
  • c/parallel/src/for.cu
  • c/parallel/src/histogram.cu
  • c/parallel/src/merge_sort.cu
  • c/parallel/src/radix_sort.cu
  • c/parallel/src/reduce.cu
  • c/parallel/src/scan.cu
  • c/parallel/src/segmented_reduce.cu
  • c/parallel/src/segmented_sort.cu
  • c/parallel/src/three_way_partition.cu
  • c/parallel/src/transform.cu
  • c/parallel/src/unique_by_key.cu
  • c/parallel/src/util/aot_serialize.h
  • c/parallel/test/test_bench_aot.cpp
  • c/parallel/test/test_reduce.cpp
  • python/cuda_cccl/cuda/compute/_bindings.pyi
  • python/cuda_cccl/cuda/compute/_bindings_impl.pyx
  • python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
  • python/cuda_cccl/cuda/compute/algorithms/_histogram.py
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_scan.py
  • python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
  • python/cuda_cccl/cuda/compute/algorithms/_transform.py
  • python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
  • python/cuda_cccl/tests/compute/bench_aot.py
  • python/cuda_cccl/tests/compute/test_binary_search_aot.py
  • python/cuda_cccl/tests/compute/test_histogram_aot.py
  • python/cuda_cccl/tests/compute/test_merge_sort_aot.py
  • python/cuda_cccl/tests/compute/test_radix_sort_aot.py
  • python/cuda_cccl/tests/compute/test_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_scan_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
  • python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
  • python/cuda_cccl/tests/compute/test_transform_aot.py
  • python/cuda_cccl/tests/compute/test_unique_by_key_aot.py
✅ Files skipped from review due to trivial changes (1)
  • c/parallel/include/cccl/c/scan.h
🚧 Files skipped from review as they are similar to previous changes (49)
  • python/cuda_cccl/tests/compute/test_merge_sort_aot.py
  • c/parallel/src/aot.cpp
  • c/parallel/include/cccl/c/for.h
  • python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_histogram.py
  • c/parallel/include/cccl/c/segmented_sort.h
  • c/parallel/include/cccl/c/binary_search.h
  • c/parallel/include/cccl/c/three_way_partition.h
  • python/cuda_cccl/tests/compute/test_unique_by_key_aot.py
  • c/parallel/include/cccl/c/histogram.h
  • c/parallel/include/cccl/c/merge_sort.h
  • python/cuda_cccl/tests/compute/test_transform_aot.py
  • c/parallel/include/cccl/c/radix_sort.h
  • python/cuda_cccl/tests/compute/test_reduce_aot.py
  • python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
  • c/parallel/include/cccl/c/reduce.h
  • c/parallel/src/for.cu
  • python/cuda_cccl/cuda/compute/algorithms/_transform.py
  • python/cuda_cccl/tests/compute/test_binary_search_aot.py
  • c/parallel/include/cccl/c/transform.h
  • python/cuda_cccl/tests/compute/test_histogram_aot.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
  • python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
  • python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
  • c/parallel/include/cccl/c/unique_by_key.h
  • c/parallel/src/radix_sort.cu
  • python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
  • python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
  • python/cuda_cccl/cuda/compute/algorithms/_scan.py
  • c/parallel/include/cccl/c/segmented_reduce.h
  • c/parallel/src/segmented_sort.cu
  • python/cuda_cccl/cuda/compute/_bindings.pyi
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • c/parallel/src/reduce.cu
  • c/parallel/src/unique_by_key.cu
  • python/cuda_cccl/tests/compute/test_scan_aot.py
  • c/parallel/src/merge_sort.cu
  • c/parallel/src/binary_search.cu
  • c/parallel/src/util/aot_serialize.h
  • c/parallel/src/scan.cu
  • c/parallel/src/transform.cu
  • c/parallel/src/segmented_reduce.cu
  • c/parallel/test/test_reduce.cpp
  • c/parallel/src/three_way_partition.cu
  • c/parallel/src/histogram.cu
  • python/cuda_cccl/cuda/compute/_bindings_impl.pyx

Comment thread c/parallel/test/test_bench_aot.cpp Outdated
@github-actions

This comment has been minimized.

@shwina shwina force-pushed the pr/aot-python-layer branch from 08c3d03 to ca828a4 Compare June 24, 2026 18:13
@shwina shwina requested a review from a team as a code owner June 24, 2026 18:13
@shwina shwina requested a review from wmaxey June 24, 2026 18:13

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d7560161-8bed-43e8-8952-5e8ca57c4952

📥 Commits

Reviewing files that changed from the base of the PR and between 08c3d03 and ca828a4.

📒 Files selected for processing (59)
  • c/parallel/include/cccl/c/aot.h
  • c/parallel/include/cccl/c/binary_search.h
  • c/parallel/include/cccl/c/for.h
  • c/parallel/include/cccl/c/histogram.h
  • c/parallel/include/cccl/c/merge_sort.h
  • c/parallel/include/cccl/c/radix_sort.h
  • c/parallel/include/cccl/c/reduce.h
  • c/parallel/include/cccl/c/scan.h
  • c/parallel/include/cccl/c/segmented_reduce.h
  • c/parallel/include/cccl/c/segmented_sort.h
  • c/parallel/include/cccl/c/three_way_partition.h
  • c/parallel/include/cccl/c/transform.h
  • c/parallel/include/cccl/c/unique_by_key.h
  • c/parallel/src/aot.cpp
  • c/parallel/src/binary_search.cu
  • c/parallel/src/for.cu
  • c/parallel/src/histogram.cu
  • c/parallel/src/merge_sort.cu
  • c/parallel/src/radix_sort.cu
  • c/parallel/src/reduce.cu
  • c/parallel/src/scan.cu
  • c/parallel/src/segmented_reduce.cu
  • c/parallel/src/segmented_sort.cu
  • c/parallel/src/three_way_partition.cu
  • c/parallel/src/transform.cu
  • c/parallel/src/unique_by_key.cu
  • c/parallel/src/util/aot_serialize.h
  • c/parallel/test/test_bench_aot.cpp
  • c/parallel/test/test_reduce.cpp
  • python/cuda_cccl/CMakeLists.txt
  • python/cuda_cccl/cuda/compute/_bindings.pyi
  • python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
  • python/cuda_cccl/cuda/compute/_bindings_aot_v2.pxi
  • python/cuda_cccl/cuda/compute/_bindings_impl.pyx
  • python/cuda_cccl/cuda/compute/_jit.py
  • python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
  • python/cuda_cccl/cuda/compute/algorithms/_histogram.py
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_scan.py
  • python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
  • python/cuda_cccl/cuda/compute/algorithms/_transform.py
  • python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
  • python/cuda_cccl/cuda/compute/op.py
  • python/cuda_cccl/tests/compute/bench_aot.py
  • python/cuda_cccl/tests/compute/test_binary_search_aot.py
  • python/cuda_cccl/tests/compute/test_histogram_aot.py
  • python/cuda_cccl/tests/compute/test_merge_sort_aot.py
  • python/cuda_cccl/tests/compute/test_radix_sort_aot.py
  • python/cuda_cccl/tests/compute/test_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_scan_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
  • python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
  • python/cuda_cccl/tests/compute/test_transform_aot.py
  • python/cuda_cccl/tests/compute/test_unique_by_key_aot.py
✅ Files skipped from review due to trivial changes (3)
  • python/cuda_cccl/CMakeLists.txt
  • python/cuda_cccl/cuda/compute/_bindings_aot_v2.pxi
  • python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
🚧 Files skipped from review as they are similar to previous changes (49)
  • c/parallel/include/cccl/c/three_way_partition.h
  • c/parallel/src/aot.cpp
  • python/cuda_cccl/cuda/compute/algorithms/_histogram.py
  • python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
  • c/parallel/include/cccl/c/reduce.h
  • python/cuda_cccl/tests/compute/test_histogram_aot.py
  • python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
  • c/parallel/include/cccl/c/segmented_reduce.h
  • python/cuda_cccl/tests/compute/test_unique_by_key_aot.py
  • python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
  • c/parallel/include/cccl/c/histogram.h
  • c/parallel/include/cccl/c/binary_search.h
  • python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
  • c/parallel/include/cccl/c/segmented_sort.h
  • c/parallel/include/cccl/c/unique_by_key.h
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
  • c/parallel/include/cccl/c/transform.h
  • python/cuda_cccl/tests/compute/test_binary_search_aot.py
  • python/cuda_cccl/tests/compute/test_transform_aot.py
  • c/parallel/src/binary_search.cu
  • c/parallel/include/cccl/c/merge_sort.h
  • python/cuda_cccl/tests/compute/test_radix_sort_aot.py
  • c/parallel/include/cccl/c/radix_sort.h
  • python/cuda_cccl/tests/compute/test_merge_sort_aot.py
  • python/cuda_cccl/tests/compute/bench_aot.py
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • python/cuda_cccl/tests/compute/test_scan_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_reduce_aot.py
  • python/cuda_cccl/cuda/compute/algorithms/_transform.py
  • c/parallel/src/segmented_reduce.cu
  • c/parallel/src/scan.cu
  • c/parallel/src/reduce.cu
  • c/parallel/src/three_way_partition.cu
  • c/parallel/src/radix_sort.cu
  • python/cuda_cccl/cuda/compute/_bindings.pyi
  • c/parallel/include/cccl/c/scan.h
  • c/parallel/src/segmented_sort.cu
  • c/parallel/src/histogram.cu
  • c/parallel/test/test_reduce.cpp
  • python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
  • c/parallel/src/unique_by_key.cu
  • c/parallel/src/util/aot_serialize.h
  • python/cuda_cccl/cuda/compute/algorithms/_scan.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
  • c/parallel/src/merge_sort.cu
  • c/parallel/src/transform.cu

Comment thread c/parallel/include/cccl/c/for.h Outdated
Comment thread c/parallel/src/for.cu Outdated
Comment thread c/parallel/src/for.cu
Comment thread python/cuda_cccl/cuda/compute/op.py Outdated
@shwina shwina force-pushed the pr/aot-python-layer branch from ca828a4 to c7719f8 Compare June 24, 2026 18:50
@shwina

shwina commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
c/parallel/test/test_bench_aot.cpp (2)

73-89: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

important: WITH_TEMP frees _d_tmp only on the success path; a failing REQUIRE after cudaMalloc skips cudaFree, leaking device memory into subsequent tests. Use an RAII guard for the temp allocation.


122-210: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

important: build-result handles (link_build/full_build/jit_build) are cleaned up only on the success path; any failing REQUIRE between compile/load and _cleanup leaks payload/library/device resources. Wrap each handle in a scope guard so cleanup always runs.

Also applies to: 237-331, 361-455.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c3294930-3ff1-4cd6-9e0d-e4afabd394c4

📥 Commits

Reviewing files that changed from the base of the PR and between ca828a4 and c7719f8.

📒 Files selected for processing (59)
  • c/parallel/include/cccl/c/aot.h
  • c/parallel/include/cccl/c/binary_search.h
  • c/parallel/include/cccl/c/for.h
  • c/parallel/include/cccl/c/histogram.h
  • c/parallel/include/cccl/c/merge_sort.h
  • c/parallel/include/cccl/c/radix_sort.h
  • c/parallel/include/cccl/c/reduce.h
  • c/parallel/include/cccl/c/scan.h
  • c/parallel/include/cccl/c/segmented_reduce.h
  • c/parallel/include/cccl/c/segmented_sort.h
  • c/parallel/include/cccl/c/three_way_partition.h
  • c/parallel/include/cccl/c/transform.h
  • c/parallel/include/cccl/c/unique_by_key.h
  • c/parallel/src/aot.cpp
  • c/parallel/src/binary_search.cu
  • c/parallel/src/for.cu
  • c/parallel/src/histogram.cu
  • c/parallel/src/merge_sort.cu
  • c/parallel/src/radix_sort.cu
  • c/parallel/src/reduce.cu
  • c/parallel/src/scan.cu
  • c/parallel/src/segmented_reduce.cu
  • c/parallel/src/segmented_sort.cu
  • c/parallel/src/three_way_partition.cu
  • c/parallel/src/transform.cu
  • c/parallel/src/unique_by_key.cu
  • c/parallel/src/util/aot_serialize.h
  • c/parallel/test/test_bench_aot.cpp
  • c/parallel/test/test_reduce.cpp
  • python/cuda_cccl/CMakeLists.txt
  • python/cuda_cccl/cuda/compute/_bindings.pyi
  • python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
  • python/cuda_cccl/cuda/compute/_bindings_aot_v2.pxi
  • python/cuda_cccl/cuda/compute/_bindings_impl.pyx
  • python/cuda_cccl/cuda/compute/_jit.py
  • python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
  • python/cuda_cccl/cuda/compute/algorithms/_histogram.py
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_scan.py
  • python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
  • python/cuda_cccl/cuda/compute/algorithms/_transform.py
  • python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
  • python/cuda_cccl/cuda/compute/op.py
  • python/cuda_cccl/tests/compute/bench_aot.py
  • python/cuda_cccl/tests/compute/test_binary_search_aot.py
  • python/cuda_cccl/tests/compute/test_histogram_aot.py
  • python/cuda_cccl/tests/compute/test_merge_sort_aot.py
  • python/cuda_cccl/tests/compute/test_radix_sort_aot.py
  • python/cuda_cccl/tests/compute/test_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_scan_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
  • python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
  • python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
  • python/cuda_cccl/tests/compute/test_transform_aot.py
  • python/cuda_cccl/tests/compute/test_unique_by_key_aot.py
✅ Files skipped from review due to trivial changes (4)
  • c/parallel/src/aot.cpp
  • python/cuda_cccl/tests/compute/test_scan_aot.py
  • python/cuda_cccl/tests/compute/bench_aot.py
  • python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
🚧 Files skipped from review as they are similar to previous changes (53)
  • python/cuda_cccl/cuda/compute/op.py
  • python/cuda_cccl/tests/compute/test_radix_sort_aot.py
  • c/parallel/include/cccl/c/scan.h
  • python/cuda_cccl/CMakeLists.txt
  • python/cuda_cccl/tests/compute/test_binary_search_aot.py
  • c/parallel/include/cccl/c/three_way_partition.h
  • python/cuda_cccl/tests/compute/test_histogram_aot.py
  • c/parallel/include/cccl/c/segmented_sort.h
  • python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
  • python/cuda_cccl/cuda/compute/_bindings_aot_v2.pxi
  • c/parallel/include/cccl/c/unique_by_key.h
  • c/parallel/include/cccl/c/segmented_reduce.h
  • c/parallel/include/cccl/c/transform.h
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_histogram.py
  • c/parallel/include/cccl/c/binary_search.h
  • c/parallel/include/cccl/c/for.h
  • c/parallel/include/cccl/c/radix_sort.h
  • python/cuda_cccl/tests/compute/test_merge_sort_aot.py
  • c/parallel/include/cccl/c/merge_sort.h
  • python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
  • python/cuda_cccl/cuda/compute/algorithms/_transform.py
  • python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
  • c/parallel/src/for.cu
  • python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
  • c/parallel/src/merge_sort.cu
  • python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
  • python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
  • python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
  • python/cuda_cccl/tests/compute/test_reduce_aot.py
  • c/parallel/include/cccl/c/reduce.h
  • c/parallel/include/cccl/c/histogram.h
  • python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
  • python/cuda_cccl/cuda/compute/algorithms/_scan.py
  • python/cuda_cccl/cuda/compute/algorithms/_reduce.py
  • c/parallel/src/scan.cu
  • python/cuda_cccl/tests/compute/test_unique_by_key_aot.py
  • python/cuda_cccl/cuda/compute/_jit.py
  • c/parallel/src/util/aot_serialize.h
  • c/parallel/src/transform.cu
  • c/parallel/src/segmented_sort.cu
  • python/cuda_cccl/tests/compute/test_transform_aot.py
  • c/parallel/src/binary_search.cu
  • c/parallel/src/reduce.cu
  • c/parallel/src/radix_sort.cu
  • c/parallel/src/histogram.cu
  • c/parallel/src/three_way_partition.cu
  • c/parallel/test/test_reduce.cpp
  • c/parallel/src/unique_by_key.cu
  • python/cuda_cccl/cuda/compute/_bindings.pyi
  • c/parallel/src/segmented_reduce.cu
  • python/cuda_cccl/cuda/compute/_bindings_impl.pyx

@shwina shwina force-pushed the pr/aot-python-layer branch from 262c509 to 29a5dfb Compare June 24, 2026 19:47
@shwina

shwina commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Addressing the minor/nitpick comments from the review summaries:

aot_serialize.h — missing <type_traits> / <cstddef>
Fixed in commit 2c68ef6 — added both headers directly.

Angle-bracket vs. quote include syntax (histogram.cu, merge_sort.cu, radix_sort.cu, segmented_reduce.cu, segmented_sort.cu, transform.cu, test_bench_aot.cpp)
False positive for this PR. The project convention is "..." for project-local headers and <...> for installed/external headers — established by pre-existing includes like "util/context.h", "util/errors.h", "util/nvjitlink.h" in the same files. The new "util/aot_serialize.h" includes follow that convention exactly. A broader style-cleanup (unifying on <> everywhere) is a separate concern.

_bindings_impl.pyx:1175-1177&view[0] guard on empty blob
False positive for the current code. The AoT deserialization code was moved from _bindings_impl.pyx into _bindings_aot_v1.pxi during this PR. That file already guards before indexing: if buf_size == 0: raise ... precedes cdef const void* buf_ptr = <const void*>&view[0] in every deserialize function.

test_merge_sort_aot.py:78np.argsort(..., stable=True) requires NumPy 2.0
False positive for this PR. stable=True is the existing project convention — it appears in 20+ pre-existing test files (test_merge_sort.py, test_radix_sort.py, test_segmented_sort.py, etc.). The new test is consistent with those.

_scan.py:407-414load_inclusive_scan missing init_value=None default
Stale comment — this function no longer exists in the current file (the file is 389 lines; the function was present in an earlier commit reviewed by CodeRabbit).

for.h:81-85 — document out-buffer ownership / deserialize semantics
Fixed in commit 262c509, which updated the cccl_device_for_deserialize doc comment to document the LTOIR→link_ltoir prerequisite and the failure semantics.

test_reduce.cpp:972-974 / test_reduce_aot.py:117-119 — strengthen failure-state assertions / add execution call
Valid test-quality improvements; deferred to a follow-up.

bench_aot.py — benchmark order neutrality, correctness assertions
Valid observations; the benchmark is an informal timing script. Deferred to a follow-up.

_transform.py:52-58 / _unique_by_key.py:74-83 — deserialize-first ordering
Valid micro-optimization; deferred to a follow-up.

aot_serialize.h:30-32 — per-algorithm kFormatVersion
Valid forward-compat observation. In practice, CCCL_VERSION (which bumps with every release) is mixed into the ABI hash alongside the algorithm name, so real-world layout breaks are caught at version boundaries. True per-algorithm versioning would be finer-grained; deferred to a follow-up.

@github-actions

This comment has been minimized.

@shwina shwina force-pushed the pr/aot-python-layer branch from bc079ac to dc94025 Compare June 25, 2026 17:22
@shwina shwina changed the title [cuda.compute]: Add .save() and .load() methods to enable ahead-of-time compilation workflows [cccl.c]: Add serialize() and deserialize() functions to enable ahead-of-time compilation workflows Jun 25, 2026
@github-actions

This comment has been minimized.

@shwina shwina force-pushed the pr/aot-python-layer branch from dc94025 to 99ae64e Compare June 25, 2026 20:09
@github-actions

This comment has been minimized.

// Version of the CCCL C parallel AoT blob format. Increment whenever a
// breaking change is made to the C parallel layer (build_result layouts,
// serialization wire format, etc.) independently of CCCL_VERSION.
#define CCCL_C_PARALLEL_VERSION 1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the ABI of a build result, this needs to change.

We should start thinking about semantic versioning for the C library.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you introducing this now to account for loading a build struct from a previous version?

@shwina

shwina commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cbc0ef38-9a40-4993-a4bd-b157d2e01fc9

📥 Commits

Reviewing files that changed from the base of the PR and between c7719f8 and 99ae64e.

📒 Files selected for processing (29)
  • c/parallel/include/cccl/c/aot.h
  • c/parallel/include/cccl/c/binary_search.h
  • c/parallel/include/cccl/c/for.h
  • c/parallel/include/cccl/c/histogram.h
  • c/parallel/include/cccl/c/merge_sort.h
  • c/parallel/include/cccl/c/radix_sort.h
  • c/parallel/include/cccl/c/reduce.h
  • c/parallel/include/cccl/c/scan.h
  • c/parallel/include/cccl/c/segmented_reduce.h
  • c/parallel/include/cccl/c/segmented_sort.h
  • c/parallel/include/cccl/c/three_way_partition.h
  • c/parallel/include/cccl/c/transform.h
  • c/parallel/include/cccl/c/unique_by_key.h
  • c/parallel/src/aot.cpp
  • c/parallel/src/binary_search.cu
  • c/parallel/src/for.cu
  • c/parallel/src/histogram.cu
  • c/parallel/src/merge_sort.cu
  • c/parallel/src/radix_sort.cu
  • c/parallel/src/reduce.cu
  • c/parallel/src/scan.cu
  • c/parallel/src/segmented_reduce.cu
  • c/parallel/src/segmented_sort.cu
  • c/parallel/src/three_way_partition.cu
  • c/parallel/src/transform.cu
  • c/parallel/src/unique_by_key.cu
  • c/parallel/src/util/aot_serialize.h
  • c/parallel/test/test_bench_aot.cpp
  • c/parallel/test/test_reduce.cpp
✅ Files skipped from review due to trivial changes (1)
  • c/parallel/src/aot.cpp
🚧 Files skipped from review as they are similar to previous changes (13)
  • c/parallel/include/cccl/c/merge_sort.h
  • c/parallel/include/cccl/c/histogram.h
  • c/parallel/include/cccl/c/for.h
  • c/parallel/include/cccl/c/reduce.h
  • c/parallel/include/cccl/c/segmented_reduce.h
  • c/parallel/include/cccl/c/radix_sort.h
  • c/parallel/include/cccl/c/binary_search.h
  • c/parallel/include/cccl/c/three_way_partition.h
  • c/parallel/include/cccl/c/scan.h
  • c/parallel/include/cccl/c/unique_by_key.h
  • c/parallel/include/cccl/c/segmented_sort.h
  • c/parallel/include/cccl/c/transform.h
  • c/parallel/test/test_reduce.cpp

Comment thread c/parallel/src/binary_search.cu Outdated
Comment thread c/parallel/src/for.cu
Comment thread c/parallel/src/histogram.cu Outdated
Comment thread c/parallel/src/merge_sort.cu Outdated
Comment thread c/parallel/src/radix_sort.cu Outdated
Comment thread c/parallel/src/segmented_sort.cu Outdated
Comment thread c/parallel/src/three_way_partition.cu
Comment thread c/parallel/src/transform.cu
Comment thread c/parallel/src/unique_by_key.cu
Comment thread c/parallel/src/util/aot_serialize.h
@shwina shwina force-pushed the pr/aot-python-layer branch from 99ae64e to 8636ee0 Compare June 26, 2026 14:23
@github-actions

This comment has been minimized.

// Version of the CCCL C parallel AoT blob format. Increment whenever a
// breaking change is made to the C parallel layer (build_result layouts,
// serialization wire format, etc.) independently of CCCL_VERSION.
#define CCCL_C_PARALLEL_VERSION 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you introducing this now to account for loading a build struct from a previous version?

Comment thread c/parallel/src/util/aot_serialize.h Outdated
// allocated with new[] (matching cccl_aot_buffer_free, which does delete[]).
class buffer_writer
{
std::vector<char> data_;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we typically don't use the trailing underscore pattern

Comment thread c/parallel/src/util/aot_serialize.h Outdated
Comment on lines +120 to +121
const char* p_;
size_t remaining_;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we typically don't use the trailing underscore pattern

Comment thread c/parallel/src/reduce.cu Outdated
std::unique_ptr<char[]> n_reduction{r.read_cstring_dup()};
std::unique_ptr<char[]> n_nondeterministic{r.read_cstring_dup()};

build_ptr->cc = static_cast<int>(h.cc);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: this starts populating the caller-provided build result without first zeroing it or assigning through a local zero-initialized result. The public header says deserialized CUlibrary/CUkernel handles remain null until load and that failure leaves the build zeroed, but this path never clears nonserialized fields like library/single_tile_kernel/reduction_kernel, and the catch path returns without zeroing build_ptr. We should either update the comment or implement the behavior it currently describes. Same applies to all algorithms

Comment thread c/parallel/src/reduce.cu Outdated
throw std::runtime_error("aot blob: empty payload");
}

std::unique_ptr<cub::detail::reduce::policy_selector, decltype(&std::free)> policy(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: this ignores the serialized determinism value and always reconstructs runtime_policy as cub::detail::reduce::policy_selector. For CCCL_NOT_GUARANTEED, serialization stores a cub::detail::reduce_nondeterministic::policy_selector, and cccl_device_reduce_nondeterministic later casts runtime_policy back to that nondeterministic type.

Comment thread c/parallel/src/segmented_sort.cu Outdated
uint64_t code_n = r.read_pod<uint64_t>();
if (code_n > 0)
{
void* p = std::malloc(static_cast<size_t>(code_n));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: this raw allocation can leak if a later read or validation throws. For example, r.read_bytes at line 1110 can throw before p is assigned into op.code, the state-size check can throw after op.code is assigned, and the state read at line 1134 can throw after both allocations exist. Please hold code/state in local RAII owners with std::free deleters and release them into cccl_op_t only after the entire selector op has been successfully decoded


const auto key_t = read_type_info(r);
const auto offset_t = read_type_info(r);
const auto order = static_cast<cccl_sort_order_t>(r.read_pod<uint32_t>());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: in radix_sort we validate that the sort order is valid (either ASCENDING or DESCENDING) after reading it. For consistency we should do the same here

shwina and others added 9 commits June 28, 2026 21:51
…ithms

Builds on the compile/load split (NVIDIA#8484) by adding
cccl_device_<algo>_serialize / _deserialize for every C parallel
algorithm, plus a versioned on-disk blob format:

  * cccl/c/aot.h          - algorithm tags + cccl_aot_buffer_free
  * util/aot_serialize.h  - blob_header (magic + ABI hash + payload kind),
                            FNV-1a ABI hashing, bounds-checked reader/writer

A blob stores the compiled payload, runtime policy and lowered kernel
names; CUlibrary/CUkernel handles are reconstructed on load. The header's
magic / algo tag / format version / ABI hash are validated on deserialize
so stale or mismatched artifacts are rejected.

Lifetime fixes: deserialize runtime_policy uses malloc (matching cleanup);
serialize early-return paths zero *out_buf/*out_size.

Review feedback:
- Document cccl_aot_buffer_free ownership in all serialize/deserialize
  declarations (was already in reduce.h/scan.h, now in all 12 headers).
- Reject empty payloads in deserialize (crafted zero-length payload blob
  would deserialize to CUDA_SUCCESS with payload==nullptr).
- Empty byte input to Cython deserialize wrappers now raises RuntimeError
  before the unsafe &view[0] index rather than IndexError.
- assert -> ValueError for d_in_values/d_out_values mismatch in _MergeSort.

Reduce gains serialize/deserialize round-trip + bad-blob tests, and the
AoT first-execution latency benchmark (test_bench_aot.cpp) is added,
guarded to the v1 (NVRTC) backend.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…JIT skip

Deserialize catch handlers were calling memset(build_ptr, 0) without
first releasing any pre-existing resources in build_ptr. If the caller
passed a previously-populated result, its payload/library/kernel-name
allocations would be leaked. Add the matching cleanup call before the
memset in every *_deserialize catch handler (10 algorithms).

Also fix _MergeSort.deserialize in Python: was calling compile() instead
of compile_for_load(), which triggered a full JIT recompile on load and
defeated the purpose of AoT.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add cccl_device_transform_cleanup(build_ptr) before std::memset in the
  cccl_device_transform_deserialize catch handler (same fix applied to the
  other 10 algorithms in the previous commit)
- Document in cccl_device_for_deserialize that LTOIR payloads require
  cccl_device_for_link_ltoir before cccl_device_for_load

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
These headers are used directly (std::is_trivially_copyable_v, size_t)
and should not rely on transitive includes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Groups of fixes, all in the C parallel layer:

Group 1 — Commit-on-success for deserialize (all 12 algos):
- Remove success-path memset(build_ptr) — contract is caller passes
  zeroed output; the memset was redundant on success and harmful if
  build_ptr carried existing owned data.
- Strip cleanup+memset from catch blocks — on failure build_ptr is
  left unchanged, avoiding destruction of the caller's valid result.
- binary_search: use a local temp result{} and commit *build_ptr = result
  after cccl_device_transform_deserialize succeeds, so the nested call
  cannot leave *build_ptr half-written on error.

Group 2 — Serialize output deterministic on failure (6 algos):
- Zero *out_buf/*out_size before the first throwing operation in
  serialize so any exception leaves the outputs null/zero rather than
  whatever the caller passed in.

Group 3 — Input validation:
- for.cu serialize: reject unsupported payload_kind and null/empty
  static_kernel_lowered_name before writing the blob header.
- for.cu deserialize: validate the read kernel name is non-empty.
- binary_search.cu deserialize: validate state_align is a non-zero
  power-of-two when state_size > 0.
- radix_sort.cu deserialize: reject unrecognized sort-order enum values.
- segmented_sort.cu deserialize: validate selector-op state blob length
  against op.size before allocating.
- transform.cu serialize+deserialize: validate runtime_policy_size
  against the two supported policy_selector layouts.

Group 4 — Test RAII (test_bench_aot.cpp):
- WITH_TEMP macro: wrap _d_tmp in a _CudaFreeGuard struct so device
  memory is released even when an intermediate REQUIRE throws.
- Build results: add BuildGuard<T,Cleanup> RAII template; wrap
  link_build / full_build / jit_build in guards for both reduce and
  scan sections so cleanup runs on REQUIRE failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…egmented_sort, sort order validation

- reduce.cu: branch on determinism when allocating/reading runtime_policy
  so CCCL_NOT_GUARANTEED blobs use reduce_nondeterministic::policy_selector
  (the load path casts back to that type); also use local result{} and
  commit-on-success so unset fields (library, kernel handles) are zero
- segmented_sort.cu: hold code/state malloc in unique_ptr<void,free>
  inside deserialize_selector_op so a throw during read_bytes doesn't
  leak the allocation; add sort-order enum validation for consistency
  with radix_sort
- aot_serialize.h: remove trailing underscores from private member
  names (data_, p_, remaining_) per project style

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lize functions

Use a local zero-initialized result{} and assign *build_ptr = result at
the end of each deserialize function so that load-only fields (CUlibrary,
CUkernel handles) are always zero in the returned struct, regardless of
what the caller passed in.

Applies to: scan, transform, three_way_partition, unique_by_key,
histogram, merge_sort, segmented_reduce, radix_sort, segmented_sort.
(binary_search, for, and reduce already used this pattern.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shwina shwina force-pushed the pr/aot-python-layer branch from 9b5d575 to 172604e Compare June 29, 2026 01:51
@github-actions

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 17m: Pass: 100%/20 | Total: 5h 32m | Max: 51m 55s | Hits: 82%/2034

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants