[cccl.c]: Add serialize() and deserialize() functions to enable ahead-of-time compilation workflows#9568
[cccl.c]: Add serialize() and deserialize() functions to enable ahead-of-time compilation workflows#9568shwina wants to merge 9 commits into
serialize() and deserialize() functions to enable ahead-of-time compilation workflows#9568Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds experimental C AoT persistence for device build results: shared blob-format helpers and a free function, serialize/deserialize declarations, implementations for all algorithms, and benchmark/test coverage for round-trips and failures. ChangesAOT build-result persistence
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 19
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (9)
c/parallel/src/histogram.cu-26-26 (1)
26-26: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winsuggestion: use angle-bracket include syntax for the new util header.
As per coding guidelines, “All header inclusions must use the syntax
<header>.”Source: Coding guidelines
c/parallel/src/merge_sort.cu-28-28 (1)
28-28: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winsuggestion: use angle-bracket include syntax for the new util header.
As per coding guidelines, “All header inclusions must use the syntax
<header>.”Source: Coding guidelines
c/parallel/src/radix_sort.cu-26-26 (1)
26-26: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winsuggestion: use angle-bracket include syntax for the new util header.
As per coding guidelines, “All header inclusions must use the syntax
<header>.”Source: Coding guidelines
c/parallel/src/segmented_sort.cu-35-35 (1)
35-35: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winsuggestion: use angle-bracket include syntax for the new util header.
As per coding guidelines, “All header inclusions must use the syntax
<header>.”Source: Coding guidelines
c/parallel/test/test_bench_aot.cpp-33-33 (1)
33-33: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winimportant:
#include "test_util.h"should use angle-bracket include form in this codebase to satisfy repository header-inclusion policy.As per coding guidelines, "All header inclusions must use the syntax
<header>".Source: Coding guidelines
python/cuda_cccl/tests/compute/test_merge_sort_aot.py-78-78 (1)
78-78: 🎯 Functional Correctness | 🟡 Minorimportant: Line 78 uses
np.argsort(..., stable=True), which requires NumPy 2.0.0+. Since cuda-cccl does not specify a minimum NumPy version, usekind="stable"instead, available since NumPy 1.15.0, to preserve stable ordering with broader compatibility.c/parallel/src/segmented_reduce.cu-34-35 (1)
34-35: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winsuggestion: Use angle-bracket syntax for these local utility includes.
The changed include block should use
<util/aot_serialize.h>and<util/nvjitlink.h>. As per coding guidelines, "All header inclusions must use the syntax<header>."Source: Coding guidelines
python/cuda_cccl/cuda/compute/_bindings_impl.pyx-1175-1177 (1)
1175-1177: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winimportant: Guard empty blobs before taking
&view[0].An empty
.aotfile currently raisesIndexErrorin Cython before the native deserializer can returnCUDA_ERROR_INVALID_VALUE. Checkview.shape[0] == 0before indexing, or passNULLwith size0and let the backend produce the consistent deserialize error.Also applies to: 1541-1543, 1725-1727, 1905-1907, 2093-2095, 2287-2289, 2456-2458, 2566-2568, 2765-2767, 2932-2934, 3124-3126, 3318-3320
python/cuda_cccl/cuda/compute/algorithms/_scan.py-407-414 (1)
407-414: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winsuggestion: Give
load_inclusive_scanthe sameinit_value=Nonedefault asmake_inclusive_scanandinclusive_scan. The reconstruction path already handlesNone, but the public loader currently raisesTypeErrorunless callers passinit_value=Noneexplicitly for no-init inclusive scans.
🧹 Nitpick comments (3)
c/parallel/test/test_reduce.cpp (1)
972-974: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winsuggestion: The test says failure leaves the build zeroed, but it only checks
build.payload. Add checks for other pointer fields (for exampleruntime_policyandsingle_tile_kernel_lowered_name) so partial-deserialize regressions are caught.python/cuda_cccl/tests/compute/test_reduce_aot.py (1)
117-119: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winsuggestion: This test verifies the determinism field value but does not execute
loaded. Add one execution call plus output assertion so_from_serializedpath selection is validated behaviorally, not only via metadata.python/cuda_cccl/tests/compute/bench_aot.py (1)
83-127: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winsuggestion: Add lightweight correctness assertions in each timed path (reduce sum, scan boundary values, merge_sort first/last key). Without output checks, the benchmark can report latency for a semantically broken execution path.
Also applies to: 141-178, 197-241
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 49794fee-65b5-440b-809a-a43c062878f3
📒 Files selected for processing (57)
c/parallel/include/cccl/c/aot.hc/parallel/include/cccl/c/binary_search.hc/parallel/include/cccl/c/for.hc/parallel/include/cccl/c/histogram.hc/parallel/include/cccl/c/merge_sort.hc/parallel/include/cccl/c/radix_sort.hc/parallel/include/cccl/c/reduce.hc/parallel/include/cccl/c/scan.hc/parallel/include/cccl/c/segmented_reduce.hc/parallel/include/cccl/c/segmented_sort.hc/parallel/include/cccl/c/three_way_partition.hc/parallel/include/cccl/c/transform.hc/parallel/include/cccl/c/unique_by_key.hc/parallel/src/aot.cppc/parallel/src/binary_search.cuc/parallel/src/for.cuc/parallel/src/histogram.cuc/parallel/src/merge_sort.cuc/parallel/src/radix_sort.cuc/parallel/src/reduce.cuc/parallel/src/scan.cuc/parallel/src/segmented_reduce.cuc/parallel/src/segmented_sort.cuc/parallel/src/three_way_partition.cuc/parallel/src/transform.cuc/parallel/src/unique_by_key.cuc/parallel/src/util/aot_serialize.hc/parallel/test/test_bench_aot.cppc/parallel/test/test_reduce.cpppython/cuda_cccl/cuda/compute/__init__.pypython/cuda_cccl/cuda/compute/_bindings.pyipython/cuda_cccl/cuda/compute/_bindings_impl.pyxpython/cuda_cccl/cuda/compute/algorithms/__init__.pypython/cuda_cccl/cuda/compute/algorithms/_binary_search.pypython/cuda_cccl/cuda/compute/algorithms/_histogram.pypython/cuda_cccl/cuda/compute/algorithms/_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_scan.pypython/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_sort/__init__.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.pypython/cuda_cccl/cuda/compute/algorithms/_three_way_partition.pypython/cuda_cccl/cuda/compute/algorithms/_transform.pypython/cuda_cccl/cuda/compute/algorithms/_unique_by_key.pypython/cuda_cccl/tests/compute/bench_aot.pypython/cuda_cccl/tests/compute/test_binary_search_aot.pypython/cuda_cccl/tests/compute/test_histogram_aot.pypython/cuda_cccl/tests/compute/test_merge_sort_aot.pypython/cuda_cccl/tests/compute/test_radix_sort_aot.pypython/cuda_cccl/tests/compute/test_reduce_aot.pypython/cuda_cccl/tests/compute/test_scan_aot.pypython/cuda_cccl/tests/compute/test_segmented_reduce_aot.pypython/cuda_cccl/tests/compute/test_segmented_sort_aot.pypython/cuda_cccl/tests/compute/test_three_way_partition_aot.pypython/cuda_cccl/tests/compute/test_transform_aot.pypython/cuda_cccl/tests/compute/test_unique_by_key_aot.py
56a7df5 to
dd3980b
Compare
|
@coderabbitai full review |
✅ Action performedFull review finished. |
dd3980b to
ad6c8b1
Compare
There was a problem hiding this comment.
Actionable comments posted: 18
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (3)
python/cuda_cccl/cuda/compute/_bindings_impl.pyx-1175-1177 (1)
1175-1177: 🎯 Functional Correctness | 🟡 Minorimportant: Guard
&view[0]indexing in deserialize methods to preserve backend error handling.Unconditional indexing of
view[0]causes an IndexError whenblobis empty (b""), preventing the backend deserializer from returning its intended error. Checkbuf_sizefirst and passNULLwhen zero:Diff for lines 1175–1177 (and apply to all 12 deserialize methods)
cdef const unsigned char[::1] view = blob - cdef const void* buf_ptr = <const void*>&view[0] cdef size_t buf_size = view.shape[0] + cdef const void* buf_ptr = NULL + if buf_size != 0: + buf_ptr = <const void*>&view[0]Applies to all twelve
deserialize()methods at lines 1176, 1542, 1726, 1906, 2094, 2288, 2457, 2567, 2766, 2933, 3125, 3319.Source: Path instructions
c/parallel/src/util/aot_serialize.h-13-19 (1)
13-19: 📐 Maintainability & Code Quality | 🟡 Minorimportant: add direct includes for
<type_traits>and<cstddef>. Line 94 usesstd::is_trivially_copyable_vand lines 81, 110, 119 usesize_twithout including the headers that define these symbols. Per coding guidelines, direct includes are required; relying on transitive includes is not allowed.Source: Coding guidelines
c/parallel/src/transform.cu-39-39 (1)
39-39: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winimportant: use angle-bracket syntax for this project header include.
#include "util/aot_serialize.h"violates the repository rule that all header inclusions use<header>syntax; use<util/aot_serialize.h>. As per coding guidelines, “All header inclusions must use the syntax<header>.”Source: Coding guidelines
🧹 Nitpick comments (6)
c/parallel/test/test_reduce.cpp (1)
972-973: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winsuggestion: Strengthen failure-state assertions in the bad-blob test. Only checking
build.payloadcan miss regressions where other owned pointers remain non-null after deserialize failure; also assertruntime_policyand lowered-name pointers are null to fully validate the zeroed-build contract.As per path instructions, focus on C API error/status handling and lifetime ownership for
c/**/*.Source: Path instructions
python/cuda_cccl/tests/compute/bench_aot.py (1)
47-62: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winsuggestion: Make the benchmark order-neutral to reduce cold-start bias. Running JIT first and AoT second in every trial systematically gives AoT the second-run advantage; run both orders (or alternate order across repetitions) and report median/mean per mode.
As per path instructions, focus on Python test/example correctness for
python/cuda_cccl/**/*.Source: Path instructions
python/cuda_cccl/cuda/compute/algorithms/_transform.py (1)
52-58: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winsuggestion: Deserialize the blob first, then rebuild iterators/op.
Line 52 and Line 135 currently do iterator conversion andmake_op_adapter(op).compile(...)before blob validation. Reordering to call_bindings.DeviceUnaryTransform.deserialize(blob)/_bindings.DeviceBinaryTransform.deserialize(blob)first avoids unnecessary work on invalid blobs and keeps the AOT load path leaner for callable ops. As per path instructions,python/cuda_cccl/**/*: “Focus on ... JIT/NVRTC/nvJitLink behavior”.Also applies to: 135-143
Source: Path instructions
python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py (1)
74-83: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winsuggestion: Fail fast on blob parse before doing op compilation.
Line 74 initializes iterators/op state beforeDeviceUniqueByKeyBuildResult.deserialize(blob)at Line 82. Swapping this order makes corrupted-blob handling cheaper and avoids unnecessary compile/setup work in the deserialize path. As per path instructions,python/cuda_cccl/**/*: “Focus on ... JIT/NVRTC/nvJitLink behavior”.Source: Path instructions
c/parallel/src/util/aot_serialize.h (1)
30-32: 🗄️ Data Integrity & Integration | 🔵 Trivial | 🏗️ Heavy liftimportant: make the blob format version algorithm-specific.
kFormatVersionis shared by every algorithm even though the comment and blob header describe a per-algorithm layout version. A future layout bump for one algorithm would reject saved blobs for every other algorithm; pass the expected/written version from each algorithm or explicitly document this as a global format version.As per path instructions, focus on C API/ABI stability, error/status handling, lifetime ownership, runtime policy handling, build-result caching, C/C++ boundary behavior, and coverage for public C parallel APIs.
Also applies to: 240-267
Source: Path instructions
c/parallel/include/cccl/c/for.h (1)
81-85: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winsuggestion: document out-buffer ownership and deserialize failure semantics for this public C API. This declaration block currently omits the allocator-boundary contract (
cccl_aot_buffer_free) and output-state guarantees on deserialize failure, which increases misuse risk for external C callers; mirror the reduce.h contract text here (and in the other newly added algorithm headers for consistency). As per path instructions, “Focus on C API/ABI stability, error/status handling, lifetime ownership, [and] C/C++ boundary behavior.”Source: Path instructions
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ccb178f0-abaa-425c-a224-a7ebbc1cf703
📒 Files selected for processing (54)
c/parallel/include/cccl/c/aot.hc/parallel/include/cccl/c/binary_search.hc/parallel/include/cccl/c/for.hc/parallel/include/cccl/c/histogram.hc/parallel/include/cccl/c/merge_sort.hc/parallel/include/cccl/c/radix_sort.hc/parallel/include/cccl/c/reduce.hc/parallel/include/cccl/c/scan.hc/parallel/include/cccl/c/segmented_reduce.hc/parallel/include/cccl/c/segmented_sort.hc/parallel/include/cccl/c/three_way_partition.hc/parallel/include/cccl/c/transform.hc/parallel/include/cccl/c/unique_by_key.hc/parallel/src/aot.cppc/parallel/src/binary_search.cuc/parallel/src/for.cuc/parallel/src/histogram.cuc/parallel/src/merge_sort.cuc/parallel/src/radix_sort.cuc/parallel/src/reduce.cuc/parallel/src/scan.cuc/parallel/src/segmented_reduce.cuc/parallel/src/segmented_sort.cuc/parallel/src/three_way_partition.cuc/parallel/src/transform.cuc/parallel/src/unique_by_key.cuc/parallel/src/util/aot_serialize.hc/parallel/test/test_bench_aot.cppc/parallel/test/test_reduce.cpppython/cuda_cccl/cuda/compute/_bindings.pyipython/cuda_cccl/cuda/compute/_bindings_impl.pyxpython/cuda_cccl/cuda/compute/algorithms/_binary_search.pypython/cuda_cccl/cuda/compute/algorithms/_histogram.pypython/cuda_cccl/cuda/compute/algorithms/_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_scan.pypython/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.pypython/cuda_cccl/cuda/compute/algorithms/_three_way_partition.pypython/cuda_cccl/cuda/compute/algorithms/_transform.pypython/cuda_cccl/cuda/compute/algorithms/_unique_by_key.pypython/cuda_cccl/tests/compute/bench_aot.pypython/cuda_cccl/tests/compute/test_binary_search_aot.pypython/cuda_cccl/tests/compute/test_histogram_aot.pypython/cuda_cccl/tests/compute/test_merge_sort_aot.pypython/cuda_cccl/tests/compute/test_radix_sort_aot.pypython/cuda_cccl/tests/compute/test_reduce_aot.pypython/cuda_cccl/tests/compute/test_scan_aot.pypython/cuda_cccl/tests/compute/test_segmented_reduce_aot.pypython/cuda_cccl/tests/compute/test_segmented_sort_aot.pypython/cuda_cccl/tests/compute/test_three_way_partition_aot.pypython/cuda_cccl/tests/compute/test_transform_aot.pypython/cuda_cccl/tests/compute/test_unique_by_key_aot.py
ad6c8b1 to
08c3d03
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
python/cuda_cccl/tests/compute/bench_aot.py (1)
47-63: 🎯 Functional Correctness | 🔵 TrivialConsider alternating benchmark order to eliminate JIT warm-up bias. The
clear_all_caches()calls only clear Python-level function caches, not the CUDA runtime's module cache. Since JIT runs first (line 52), any compiled kernels and driver state remain in memory for the AoT path (line 59), potentially lowering AoT timing. Run both order sequences (JIT→AoT and AoT→JIT) and report median, or deserialize in isolated subprocesses for true cold-start measurement.Source: Path instructions
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a594ceb0-c175-4514-b3f6-b767133bf1dc
📒 Files selected for processing (54)
c/parallel/include/cccl/c/aot.hc/parallel/include/cccl/c/binary_search.hc/parallel/include/cccl/c/for.hc/parallel/include/cccl/c/histogram.hc/parallel/include/cccl/c/merge_sort.hc/parallel/include/cccl/c/radix_sort.hc/parallel/include/cccl/c/reduce.hc/parallel/include/cccl/c/scan.hc/parallel/include/cccl/c/segmented_reduce.hc/parallel/include/cccl/c/segmented_sort.hc/parallel/include/cccl/c/three_way_partition.hc/parallel/include/cccl/c/transform.hc/parallel/include/cccl/c/unique_by_key.hc/parallel/src/aot.cppc/parallel/src/binary_search.cuc/parallel/src/for.cuc/parallel/src/histogram.cuc/parallel/src/merge_sort.cuc/parallel/src/radix_sort.cuc/parallel/src/reduce.cuc/parallel/src/scan.cuc/parallel/src/segmented_reduce.cuc/parallel/src/segmented_sort.cuc/parallel/src/three_way_partition.cuc/parallel/src/transform.cuc/parallel/src/unique_by_key.cuc/parallel/src/util/aot_serialize.hc/parallel/test/test_bench_aot.cppc/parallel/test/test_reduce.cpppython/cuda_cccl/cuda/compute/_bindings.pyipython/cuda_cccl/cuda/compute/_bindings_impl.pyxpython/cuda_cccl/cuda/compute/algorithms/_binary_search.pypython/cuda_cccl/cuda/compute/algorithms/_histogram.pypython/cuda_cccl/cuda/compute/algorithms/_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_scan.pypython/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.pypython/cuda_cccl/cuda/compute/algorithms/_three_way_partition.pypython/cuda_cccl/cuda/compute/algorithms/_transform.pypython/cuda_cccl/cuda/compute/algorithms/_unique_by_key.pypython/cuda_cccl/tests/compute/bench_aot.pypython/cuda_cccl/tests/compute/test_binary_search_aot.pypython/cuda_cccl/tests/compute/test_histogram_aot.pypython/cuda_cccl/tests/compute/test_merge_sort_aot.pypython/cuda_cccl/tests/compute/test_radix_sort_aot.pypython/cuda_cccl/tests/compute/test_reduce_aot.pypython/cuda_cccl/tests/compute/test_scan_aot.pypython/cuda_cccl/tests/compute/test_segmented_reduce_aot.pypython/cuda_cccl/tests/compute/test_segmented_sort_aot.pypython/cuda_cccl/tests/compute/test_three_way_partition_aot.pypython/cuda_cccl/tests/compute/test_transform_aot.pypython/cuda_cccl/tests/compute/test_unique_by_key_aot.py
✅ Files skipped from review due to trivial changes (1)
- c/parallel/include/cccl/c/scan.h
🚧 Files skipped from review as they are similar to previous changes (49)
- python/cuda_cccl/tests/compute/test_merge_sort_aot.py
- c/parallel/src/aot.cpp
- c/parallel/include/cccl/c/for.h
- python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
- python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
- python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
- python/cuda_cccl/cuda/compute/algorithms/_histogram.py
- c/parallel/include/cccl/c/segmented_sort.h
- c/parallel/include/cccl/c/binary_search.h
- c/parallel/include/cccl/c/three_way_partition.h
- python/cuda_cccl/tests/compute/test_unique_by_key_aot.py
- c/parallel/include/cccl/c/histogram.h
- c/parallel/include/cccl/c/merge_sort.h
- python/cuda_cccl/tests/compute/test_transform_aot.py
- c/parallel/include/cccl/c/radix_sort.h
- python/cuda_cccl/tests/compute/test_reduce_aot.py
- python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
- c/parallel/include/cccl/c/reduce.h
- c/parallel/src/for.cu
- python/cuda_cccl/cuda/compute/algorithms/_transform.py
- python/cuda_cccl/tests/compute/test_binary_search_aot.py
- c/parallel/include/cccl/c/transform.h
- python/cuda_cccl/tests/compute/test_histogram_aot.py
- python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
- python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
- python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
- python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
- c/parallel/include/cccl/c/unique_by_key.h
- c/parallel/src/radix_sort.cu
- python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
- python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
- python/cuda_cccl/cuda/compute/algorithms/_scan.py
- c/parallel/include/cccl/c/segmented_reduce.h
- c/parallel/src/segmented_sort.cu
- python/cuda_cccl/cuda/compute/_bindings.pyi
- python/cuda_cccl/cuda/compute/algorithms/_reduce.py
- c/parallel/src/reduce.cu
- c/parallel/src/unique_by_key.cu
- python/cuda_cccl/tests/compute/test_scan_aot.py
- c/parallel/src/merge_sort.cu
- c/parallel/src/binary_search.cu
- c/parallel/src/util/aot_serialize.h
- c/parallel/src/scan.cu
- c/parallel/src/transform.cu
- c/parallel/src/segmented_reduce.cu
- c/parallel/test/test_reduce.cpp
- c/parallel/src/three_way_partition.cu
- c/parallel/src/histogram.cu
- python/cuda_cccl/cuda/compute/_bindings_impl.pyx
This comment has been minimized.
This comment has been minimized.
08c3d03 to
ca828a4
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d7560161-8bed-43e8-8952-5e8ca57c4952
📒 Files selected for processing (59)
c/parallel/include/cccl/c/aot.hc/parallel/include/cccl/c/binary_search.hc/parallel/include/cccl/c/for.hc/parallel/include/cccl/c/histogram.hc/parallel/include/cccl/c/merge_sort.hc/parallel/include/cccl/c/radix_sort.hc/parallel/include/cccl/c/reduce.hc/parallel/include/cccl/c/scan.hc/parallel/include/cccl/c/segmented_reduce.hc/parallel/include/cccl/c/segmented_sort.hc/parallel/include/cccl/c/three_way_partition.hc/parallel/include/cccl/c/transform.hc/parallel/include/cccl/c/unique_by_key.hc/parallel/src/aot.cppc/parallel/src/binary_search.cuc/parallel/src/for.cuc/parallel/src/histogram.cuc/parallel/src/merge_sort.cuc/parallel/src/radix_sort.cuc/parallel/src/reduce.cuc/parallel/src/scan.cuc/parallel/src/segmented_reduce.cuc/parallel/src/segmented_sort.cuc/parallel/src/three_way_partition.cuc/parallel/src/transform.cuc/parallel/src/unique_by_key.cuc/parallel/src/util/aot_serialize.hc/parallel/test/test_bench_aot.cppc/parallel/test/test_reduce.cpppython/cuda_cccl/CMakeLists.txtpython/cuda_cccl/cuda/compute/_bindings.pyipython/cuda_cccl/cuda/compute/_bindings_aot_v1.pxipython/cuda_cccl/cuda/compute/_bindings_aot_v2.pxipython/cuda_cccl/cuda/compute/_bindings_impl.pyxpython/cuda_cccl/cuda/compute/_jit.pypython/cuda_cccl/cuda/compute/algorithms/_binary_search.pypython/cuda_cccl/cuda/compute/algorithms/_histogram.pypython/cuda_cccl/cuda/compute/algorithms/_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_scan.pypython/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.pypython/cuda_cccl/cuda/compute/algorithms/_three_way_partition.pypython/cuda_cccl/cuda/compute/algorithms/_transform.pypython/cuda_cccl/cuda/compute/algorithms/_unique_by_key.pypython/cuda_cccl/cuda/compute/op.pypython/cuda_cccl/tests/compute/bench_aot.pypython/cuda_cccl/tests/compute/test_binary_search_aot.pypython/cuda_cccl/tests/compute/test_histogram_aot.pypython/cuda_cccl/tests/compute/test_merge_sort_aot.pypython/cuda_cccl/tests/compute/test_radix_sort_aot.pypython/cuda_cccl/tests/compute/test_reduce_aot.pypython/cuda_cccl/tests/compute/test_scan_aot.pypython/cuda_cccl/tests/compute/test_segmented_reduce_aot.pypython/cuda_cccl/tests/compute/test_segmented_sort_aot.pypython/cuda_cccl/tests/compute/test_three_way_partition_aot.pypython/cuda_cccl/tests/compute/test_transform_aot.pypython/cuda_cccl/tests/compute/test_unique_by_key_aot.py
✅ Files skipped from review due to trivial changes (3)
- python/cuda_cccl/CMakeLists.txt
- python/cuda_cccl/cuda/compute/_bindings_aot_v2.pxi
- python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
🚧 Files skipped from review as they are similar to previous changes (49)
- c/parallel/include/cccl/c/three_way_partition.h
- c/parallel/src/aot.cpp
- python/cuda_cccl/cuda/compute/algorithms/_histogram.py
- python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
- c/parallel/include/cccl/c/reduce.h
- python/cuda_cccl/tests/compute/test_histogram_aot.py
- python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
- c/parallel/include/cccl/c/segmented_reduce.h
- python/cuda_cccl/tests/compute/test_unique_by_key_aot.py
- python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
- c/parallel/include/cccl/c/histogram.h
- c/parallel/include/cccl/c/binary_search.h
- python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
- c/parallel/include/cccl/c/segmented_sort.h
- c/parallel/include/cccl/c/unique_by_key.h
- python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
- python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
- c/parallel/include/cccl/c/transform.h
- python/cuda_cccl/tests/compute/test_binary_search_aot.py
- python/cuda_cccl/tests/compute/test_transform_aot.py
- c/parallel/src/binary_search.cu
- c/parallel/include/cccl/c/merge_sort.h
- python/cuda_cccl/tests/compute/test_radix_sort_aot.py
- c/parallel/include/cccl/c/radix_sort.h
- python/cuda_cccl/tests/compute/test_merge_sort_aot.py
- python/cuda_cccl/tests/compute/bench_aot.py
- python/cuda_cccl/cuda/compute/algorithms/_reduce.py
- python/cuda_cccl/tests/compute/test_scan_aot.py
- python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
- python/cuda_cccl/tests/compute/test_reduce_aot.py
- python/cuda_cccl/cuda/compute/algorithms/_transform.py
- c/parallel/src/segmented_reduce.cu
- c/parallel/src/scan.cu
- c/parallel/src/reduce.cu
- c/parallel/src/three_way_partition.cu
- c/parallel/src/radix_sort.cu
- python/cuda_cccl/cuda/compute/_bindings.pyi
- c/parallel/include/cccl/c/scan.h
- c/parallel/src/segmented_sort.cu
- c/parallel/src/histogram.cu
- c/parallel/test/test_reduce.cpp
- python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
- c/parallel/src/unique_by_key.cu
- c/parallel/src/util/aot_serialize.h
- python/cuda_cccl/cuda/compute/algorithms/_scan.py
- python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
- python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
- c/parallel/src/merge_sort.cu
- c/parallel/src/transform.cu
ca828a4 to
c7719f8
Compare
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
♻️ Duplicate comments (2)
c/parallel/test/test_bench_aot.cpp (2)
73-89: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winimportant:
WITH_TEMPfrees_d_tmponly on the success path; a failingREQUIREaftercudaMallocskipscudaFree, leaking device memory into subsequent tests. Use an RAII guard for the temp allocation.
122-210: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy liftimportant: build-result handles (
link_build/full_build/jit_build) are cleaned up only on the success path; any failingREQUIREbetween compile/load and_cleanupleaks payload/library/device resources. Wrap each handle in a scope guard so cleanup always runs.Also applies to: 237-331, 361-455.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c3294930-3ff1-4cd6-9e0d-e4afabd394c4
📒 Files selected for processing (59)
c/parallel/include/cccl/c/aot.hc/parallel/include/cccl/c/binary_search.hc/parallel/include/cccl/c/for.hc/parallel/include/cccl/c/histogram.hc/parallel/include/cccl/c/merge_sort.hc/parallel/include/cccl/c/radix_sort.hc/parallel/include/cccl/c/reduce.hc/parallel/include/cccl/c/scan.hc/parallel/include/cccl/c/segmented_reduce.hc/parallel/include/cccl/c/segmented_sort.hc/parallel/include/cccl/c/three_way_partition.hc/parallel/include/cccl/c/transform.hc/parallel/include/cccl/c/unique_by_key.hc/parallel/src/aot.cppc/parallel/src/binary_search.cuc/parallel/src/for.cuc/parallel/src/histogram.cuc/parallel/src/merge_sort.cuc/parallel/src/radix_sort.cuc/parallel/src/reduce.cuc/parallel/src/scan.cuc/parallel/src/segmented_reduce.cuc/parallel/src/segmented_sort.cuc/parallel/src/three_way_partition.cuc/parallel/src/transform.cuc/parallel/src/unique_by_key.cuc/parallel/src/util/aot_serialize.hc/parallel/test/test_bench_aot.cppc/parallel/test/test_reduce.cpppython/cuda_cccl/CMakeLists.txtpython/cuda_cccl/cuda/compute/_bindings.pyipython/cuda_cccl/cuda/compute/_bindings_aot_v1.pxipython/cuda_cccl/cuda/compute/_bindings_aot_v2.pxipython/cuda_cccl/cuda/compute/_bindings_impl.pyxpython/cuda_cccl/cuda/compute/_jit.pypython/cuda_cccl/cuda/compute/algorithms/_binary_search.pypython/cuda_cccl/cuda/compute/algorithms/_histogram.pypython/cuda_cccl/cuda/compute/algorithms/_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_scan.pypython/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.pypython/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.pypython/cuda_cccl/cuda/compute/algorithms/_three_way_partition.pypython/cuda_cccl/cuda/compute/algorithms/_transform.pypython/cuda_cccl/cuda/compute/algorithms/_unique_by_key.pypython/cuda_cccl/cuda/compute/op.pypython/cuda_cccl/tests/compute/bench_aot.pypython/cuda_cccl/tests/compute/test_binary_search_aot.pypython/cuda_cccl/tests/compute/test_histogram_aot.pypython/cuda_cccl/tests/compute/test_merge_sort_aot.pypython/cuda_cccl/tests/compute/test_radix_sort_aot.pypython/cuda_cccl/tests/compute/test_reduce_aot.pypython/cuda_cccl/tests/compute/test_scan_aot.pypython/cuda_cccl/tests/compute/test_segmented_reduce_aot.pypython/cuda_cccl/tests/compute/test_segmented_sort_aot.pypython/cuda_cccl/tests/compute/test_three_way_partition_aot.pypython/cuda_cccl/tests/compute/test_transform_aot.pypython/cuda_cccl/tests/compute/test_unique_by_key_aot.py
✅ Files skipped from review due to trivial changes (4)
- c/parallel/src/aot.cpp
- python/cuda_cccl/tests/compute/test_scan_aot.py
- python/cuda_cccl/tests/compute/bench_aot.py
- python/cuda_cccl/cuda/compute/_bindings_aot_v1.pxi
🚧 Files skipped from review as they are similar to previous changes (53)
- python/cuda_cccl/cuda/compute/op.py
- python/cuda_cccl/tests/compute/test_radix_sort_aot.py
- c/parallel/include/cccl/c/scan.h
- python/cuda_cccl/CMakeLists.txt
- python/cuda_cccl/tests/compute/test_binary_search_aot.py
- c/parallel/include/cccl/c/three_way_partition.h
- python/cuda_cccl/tests/compute/test_histogram_aot.py
- c/parallel/include/cccl/c/segmented_sort.h
- python/cuda_cccl/cuda/compute/algorithms/_unique_by_key.py
- python/cuda_cccl/cuda/compute/_bindings_aot_v2.pxi
- c/parallel/include/cccl/c/unique_by_key.h
- c/parallel/include/cccl/c/segmented_reduce.h
- c/parallel/include/cccl/c/transform.h
- python/cuda_cccl/cuda/compute/algorithms/_sort/_radix_sort.py
- python/cuda_cccl/cuda/compute/algorithms/_histogram.py
- c/parallel/include/cccl/c/binary_search.h
- c/parallel/include/cccl/c/for.h
- c/parallel/include/cccl/c/radix_sort.h
- python/cuda_cccl/tests/compute/test_merge_sort_aot.py
- c/parallel/include/cccl/c/merge_sort.h
- python/cuda_cccl/cuda/compute/algorithms/_three_way_partition.py
- python/cuda_cccl/cuda/compute/algorithms/_transform.py
- python/cuda_cccl/tests/compute/test_segmented_sort_aot.py
- c/parallel/src/for.cu
- python/cuda_cccl/tests/compute/test_segmented_reduce_aot.py
- c/parallel/src/merge_sort.cu
- python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py
- python/cuda_cccl/tests/compute/test_three_way_partition_aot.py
- python/cuda_cccl/cuda/compute/algorithms/_binary_search.py
- python/cuda_cccl/cuda/compute/algorithms/_sort/_segmented_sort.py
- python/cuda_cccl/tests/compute/test_reduce_aot.py
- c/parallel/include/cccl/c/reduce.h
- c/parallel/include/cccl/c/histogram.h
- python/cuda_cccl/cuda/compute/algorithms/_sort/_merge_sort.py
- python/cuda_cccl/cuda/compute/algorithms/_scan.py
- python/cuda_cccl/cuda/compute/algorithms/_reduce.py
- c/parallel/src/scan.cu
- python/cuda_cccl/tests/compute/test_unique_by_key_aot.py
- python/cuda_cccl/cuda/compute/_jit.py
- c/parallel/src/util/aot_serialize.h
- c/parallel/src/transform.cu
- c/parallel/src/segmented_sort.cu
- python/cuda_cccl/tests/compute/test_transform_aot.py
- c/parallel/src/binary_search.cu
- c/parallel/src/reduce.cu
- c/parallel/src/radix_sort.cu
- c/parallel/src/histogram.cu
- c/parallel/src/three_way_partition.cu
- c/parallel/test/test_reduce.cpp
- c/parallel/src/unique_by_key.cu
- python/cuda_cccl/cuda/compute/_bindings.pyi
- c/parallel/src/segmented_reduce.cu
- python/cuda_cccl/cuda/compute/_bindings_impl.pyx
262c509 to
29a5dfb
Compare
|
Addressing the minor/nitpick comments from the review summaries:
Angle-bracket vs. quote include syntax (
|
This comment has been minimized.
This comment has been minimized.
bc079ac to
dc94025
Compare
.save() and .load() methods to enable ahead-of-time compilation workflowsserialize() and deserialize() functions to enable ahead-of-time compilation workflows
This comment has been minimized.
This comment has been minimized.
dc94025 to
99ae64e
Compare
This comment has been minimized.
This comment has been minimized.
| // Version of the CCCL C parallel AoT blob format. Increment whenever a | ||
| // breaking change is made to the C parallel layer (build_result layouts, | ||
| // serialization wire format, etc.) independently of CCCL_VERSION. | ||
| #define CCCL_C_PARALLEL_VERSION 1 |
There was a problem hiding this comment.
If we change the ABI of a build result, this needs to change.
We should start thinking about semantic versioning for the C library.
There was a problem hiding this comment.
Are you introducing this now to account for loading a build struct from a previous version?
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 13
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cbc0ef38-9a40-4993-a4bd-b157d2e01fc9
📒 Files selected for processing (29)
c/parallel/include/cccl/c/aot.hc/parallel/include/cccl/c/binary_search.hc/parallel/include/cccl/c/for.hc/parallel/include/cccl/c/histogram.hc/parallel/include/cccl/c/merge_sort.hc/parallel/include/cccl/c/radix_sort.hc/parallel/include/cccl/c/reduce.hc/parallel/include/cccl/c/scan.hc/parallel/include/cccl/c/segmented_reduce.hc/parallel/include/cccl/c/segmented_sort.hc/parallel/include/cccl/c/three_way_partition.hc/parallel/include/cccl/c/transform.hc/parallel/include/cccl/c/unique_by_key.hc/parallel/src/aot.cppc/parallel/src/binary_search.cuc/parallel/src/for.cuc/parallel/src/histogram.cuc/parallel/src/merge_sort.cuc/parallel/src/radix_sort.cuc/parallel/src/reduce.cuc/parallel/src/scan.cuc/parallel/src/segmented_reduce.cuc/parallel/src/segmented_sort.cuc/parallel/src/three_way_partition.cuc/parallel/src/transform.cuc/parallel/src/unique_by_key.cuc/parallel/src/util/aot_serialize.hc/parallel/test/test_bench_aot.cppc/parallel/test/test_reduce.cpp
✅ Files skipped from review due to trivial changes (1)
- c/parallel/src/aot.cpp
🚧 Files skipped from review as they are similar to previous changes (13)
- c/parallel/include/cccl/c/merge_sort.h
- c/parallel/include/cccl/c/histogram.h
- c/parallel/include/cccl/c/for.h
- c/parallel/include/cccl/c/reduce.h
- c/parallel/include/cccl/c/segmented_reduce.h
- c/parallel/include/cccl/c/radix_sort.h
- c/parallel/include/cccl/c/binary_search.h
- c/parallel/include/cccl/c/three_way_partition.h
- c/parallel/include/cccl/c/scan.h
- c/parallel/include/cccl/c/unique_by_key.h
- c/parallel/include/cccl/c/segmented_sort.h
- c/parallel/include/cccl/c/transform.h
- c/parallel/test/test_reduce.cpp
99ae64e to
8636ee0
Compare
This comment has been minimized.
This comment has been minimized.
| // Version of the CCCL C parallel AoT blob format. Increment whenever a | ||
| // breaking change is made to the C parallel layer (build_result layouts, | ||
| // serialization wire format, etc.) independently of CCCL_VERSION. | ||
| #define CCCL_C_PARALLEL_VERSION 1 |
There was a problem hiding this comment.
Are you introducing this now to account for loading a build struct from a previous version?
| // allocated with new[] (matching cccl_aot_buffer_free, which does delete[]). | ||
| class buffer_writer | ||
| { | ||
| std::vector<char> data_; |
There was a problem hiding this comment.
Nit: we typically don't use the trailing underscore pattern
| const char* p_; | ||
| size_t remaining_; |
There was a problem hiding this comment.
Nit: we typically don't use the trailing underscore pattern
| std::unique_ptr<char[]> n_reduction{r.read_cstring_dup()}; | ||
| std::unique_ptr<char[]> n_nondeterministic{r.read_cstring_dup()}; | ||
|
|
||
| build_ptr->cc = static_cast<int>(h.cc); |
There was a problem hiding this comment.
Important: this starts populating the caller-provided build result without first zeroing it or assigning through a local zero-initialized result. The public header says deserialized CUlibrary/CUkernel handles remain null until load and that failure leaves the build zeroed, but this path never clears nonserialized fields like library/single_tile_kernel/reduction_kernel, and the catch path returns without zeroing build_ptr. We should either update the comment or implement the behavior it currently describes. Same applies to all algorithms
| throw std::runtime_error("aot blob: empty payload"); | ||
| } | ||
|
|
||
| std::unique_ptr<cub::detail::reduce::policy_selector, decltype(&std::free)> policy( |
There was a problem hiding this comment.
Critical: this ignores the serialized determinism value and always reconstructs runtime_policy as cub::detail::reduce::policy_selector. For CCCL_NOT_GUARANTEED, serialization stores a cub::detail::reduce_nondeterministic::policy_selector, and cccl_device_reduce_nondeterministic later casts runtime_policy back to that nondeterministic type.
| uint64_t code_n = r.read_pod<uint64_t>(); | ||
| if (code_n > 0) | ||
| { | ||
| void* p = std::malloc(static_cast<size_t>(code_n)); |
There was a problem hiding this comment.
Important: this raw allocation can leak if a later read or validation throws. For example, r.read_bytes at line 1110 can throw before p is assigned into op.code, the state-size check can throw after op.code is assigned, and the state read at line 1134 can throw after both allocations exist. Please hold code/state in local RAII owners with std::free deleters and release them into cccl_op_t only after the entire selector op has been successfully decoded
|
|
||
| const auto key_t = read_type_info(r); | ||
| const auto offset_t = read_type_info(r); | ||
| const auto order = static_cast<cccl_sort_order_t>(r.read_pod<uint32_t>()); |
There was a problem hiding this comment.
Suggestion: in radix_sort we validate that the sort order is valid (either ASCENDING or DESCENDING) after reading it. For consistency we should do the same here
…ithms Builds on the compile/load split (NVIDIA#8484) by adding cccl_device_<algo>_serialize / _deserialize for every C parallel algorithm, plus a versioned on-disk blob format: * cccl/c/aot.h - algorithm tags + cccl_aot_buffer_free * util/aot_serialize.h - blob_header (magic + ABI hash + payload kind), FNV-1a ABI hashing, bounds-checked reader/writer A blob stores the compiled payload, runtime policy and lowered kernel names; CUlibrary/CUkernel handles are reconstructed on load. The header's magic / algo tag / format version / ABI hash are validated on deserialize so stale or mismatched artifacts are rejected. Lifetime fixes: deserialize runtime_policy uses malloc (matching cleanup); serialize early-return paths zero *out_buf/*out_size. Review feedback: - Document cccl_aot_buffer_free ownership in all serialize/deserialize declarations (was already in reduce.h/scan.h, now in all 12 headers). - Reject empty payloads in deserialize (crafted zero-length payload blob would deserialize to CUDA_SUCCESS with payload==nullptr). - Empty byte input to Cython deserialize wrappers now raises RuntimeError before the unsafe &view[0] index rather than IndexError. - assert -> ValueError for d_in_values/d_out_values mismatch in _MergeSort. Reduce gains serialize/deserialize round-trip + bad-blob tests, and the AoT first-execution latency benchmark (test_bench_aot.cpp) is added, guarded to the v1 (NVRTC) backend. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…JIT skip Deserialize catch handlers were calling memset(build_ptr, 0) without first releasing any pre-existing resources in build_ptr. If the caller passed a previously-populated result, its payload/library/kernel-name allocations would be leaked. Add the matching cleanup call before the memset in every *_deserialize catch handler (10 algorithms). Also fix _MergeSort.deserialize in Python: was calling compile() instead of compile_for_load(), which triggered a full JIT recompile on load and defeated the purpose of AoT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add cccl_device_transform_cleanup(build_ptr) before std::memset in the cccl_device_transform_deserialize catch handler (same fix applied to the other 10 algorithms in the previous commit) - Document in cccl_device_for_deserialize that LTOIR payloads require cccl_device_for_link_ltoir before cccl_device_for_load Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
These headers are used directly (std::is_trivially_copyable_v, size_t) and should not rely on transitive includes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Groups of fixes, all in the C parallel layer:
Group 1 — Commit-on-success for deserialize (all 12 algos):
- Remove success-path memset(build_ptr) — contract is caller passes
zeroed output; the memset was redundant on success and harmful if
build_ptr carried existing owned data.
- Strip cleanup+memset from catch blocks — on failure build_ptr is
left unchanged, avoiding destruction of the caller's valid result.
- binary_search: use a local temp result{} and commit *build_ptr = result
after cccl_device_transform_deserialize succeeds, so the nested call
cannot leave *build_ptr half-written on error.
Group 2 — Serialize output deterministic on failure (6 algos):
- Zero *out_buf/*out_size before the first throwing operation in
serialize so any exception leaves the outputs null/zero rather than
whatever the caller passed in.
Group 3 — Input validation:
- for.cu serialize: reject unsupported payload_kind and null/empty
static_kernel_lowered_name before writing the blob header.
- for.cu deserialize: validate the read kernel name is non-empty.
- binary_search.cu deserialize: validate state_align is a non-zero
power-of-two when state_size > 0.
- radix_sort.cu deserialize: reject unrecognized sort-order enum values.
- segmented_sort.cu deserialize: validate selector-op state blob length
against op.size before allocating.
- transform.cu serialize+deserialize: validate runtime_policy_size
against the two supported policy_selector layouts.
Group 4 — Test RAII (test_bench_aot.cpp):
- WITH_TEMP macro: wrap _d_tmp in a _CudaFreeGuard struct so device
memory is released even when an intermediate REQUIRE throws.
- Build results: add BuildGuard<T,Cleanup> RAII template; wrap
link_build / full_build / jit_build in guards for both reduce and
scan sections so cleanup runs on REQUIRE failure.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…egmented_sort, sort order validation
- reduce.cu: branch on determinism when allocating/reading runtime_policy
so CCCL_NOT_GUARANTEED blobs use reduce_nondeterministic::policy_selector
(the load path casts back to that type); also use local result{} and
commit-on-success so unset fields (library, kernel handles) are zero
- segmented_sort.cu: hold code/state malloc in unique_ptr<void,free>
inside deserialize_selector_op so a throw during read_bytes doesn't
leak the allocation; add sort-order enum validation for consistency
with radix_sort
- aot_serialize.h: remove trailing underscores from private member
names (data_, p_, remaining_) per project style
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lize functions
Use a local zero-initialized result{} and assign *build_ptr = result at
the end of each deserialize function so that load-only fields (CUlibrary,
CUkernel handles) are always zero in the returned struct, regardless of
what the caller passed in.
Applies to: scan, transform, three_way_partition, unique_by_key,
histogram, merge_sort, segmented_reduce, radix_sort, segmented_sort.
(binary_search, for, and reduce already used this pattern.)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
9b5d575 to
172604e
Compare
🥳 CI Workflow Results🟩 Finished in 1h 17m: Pass: 100%/20 | Total: 5h 32m | Max: 51m 55s | Hits: 82%/2034See results here. |
Description
This PR adds
cccl_<algo>_serializeandcccl_<algo>_deserializeAPIs to all algorithms, which serialize/deserialize the<algo>_build_result_tstructs into bytes. Callers can, e.g., write/read these bytes to disk to implement an ahead-of-time compilation workflow; or send bytes across the wire in a multi-node workflow.Some notes:
v1,v2build structs have a totally different composition - we'll deal with that later.CCCL_C_PARALLEL_VERSIONdefined inaot.hrepresents the ABI version. Any changes in the ABI of the build structs should result in this version being bumped (currently set to1).v1is on its way out.reduce_into. There will be a follow-up Python PR with more extensive testing.Checklist