Skip to content

Add tensorrt_multi_device_inference operator (multi-GPU TensorRT Multi-Device)#1631

Draft
pkisfaludi-nv wants to merge 2 commits into
nvidia-holoscan:mainfrom
pkisfaludi-nv:feature/tensorrt-multi-device-inference
Draft

Add tensorrt_multi_device_inference operator (multi-GPU TensorRT Multi-Device)#1631
pkisfaludi-nv wants to merge 2 commits into
nvidia-holoscan:mainfrom
pkisfaludi-nv:feature/tensorrt-multi-device-inference

Conversation

@pkisfaludi-nv

Copy link
Copy Markdown

Summary

Adds a self-contained tensorrt_multi_device_inference operator (+ a minimal sample app) that runs a single TensorRT engine sharded across ≥2 GPUs via TensorRT Multi-Device (NCCL DistCollective + IExecutionContext::setCommunicator), so one operator drives N GPUs — for models too large for one GPU or that benefit from tensor parallelism. (TRT-28040)

It wraps a hardware-validated MultiDeviceTrt core: ncclCommInitAll → per-rank deserialize → concurrent setCommunicator → host-bounce input replication → fan-out enqueueV3 → rank-0 output. It does not depend on the SDK's HoloInfer/InferenceOp.

What's added

  • operators/tensorrt_multi_device_inference/ — operator (TensorRtMultiDeviceInferenceOp) + the reused MD core + metadata.json + CMakeLists.txt + README.md; registered in operators/CMakeLists.txt.
  • applications/multi_device_inference/ — a minimal source → MD inference → checksum sink C++ demo + config + metadata + README; registered in applications/CMakeLists.txt.

Requirements (please review)

  • TensorRT ≥ 11.0 (Multi-Device is GA in TensorRT 11) and NCCL — these are not in the stock Holoscan/HoloHub container (TRT 10). Open question for maintainers: how would you prefer to provide a TRT-11 + NCCL build environment for this operator (a dedicated Dockerfile stage, a CI image, or gating)? I did not modify the shared Dockerfile per AGENTS.md.
  • ≥ 2 homogeneous GPUs (SM80+); engine(s) sharded offline.

Validation status

  • Draft. The Multi-Device runtime core (multidevice.cpp) is validated on 2× NVIDIA B200 (TensorRT 11.1): a tensor-parallel MLP sharded across 2 GPUs matched the 1-GPU reference (max_rel 1.24e-05).
  • The HoloHub operator/app wrapper has not been built in CI yet (needs the TRT-11 + NCCL container above) — clang-format and metadata structure pass locally. Marking draft until the build environment question is resolved and CI is green.

DCO signed-off. Companion Holoscan-SDK MR (HoloInfer-internal variant): TRT-28040 / holoscan-sdk!4577.

🤖 Generated with Claude Code

A self-contained HoloHub operator that runs a single TensorRT engine sharded
across >=2 GPUs via TensorRT Multi-Device (NCCL DistCollective + setCommunicator),
so one operator drives N GPUs (TRT-28040). Wraps the hardware-validated
MultiDeviceTrt core (ncclCommInitAll -> per-rank deserialize -> concurrent
setCommunicator -> host-bounce input replication -> fan-out enqueueV3).

- operators/tensorrt_multi_device_inference/: operator + reused MD core +
  metadata.json + CMakeLists + README; registered in operators/CMakeLists.txt.
- applications/multi_device_inference/: minimal source -> MD inference -> checksum
  sink demo (cpp), config, metadata, README; registered in applications/CMakeLists.txt.

Requires TensorRT >= 11 (Multi-Device GA), NCCL, and >= 2 homogeneous GPUs (SM80+).
MD core validated on 2x B200 (TensorRT 11.1): TP-sharded MLP across 2 GPUs vs the
1-GPU reference, max_rel 1.24e-05.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Peter Kisfaludi <pkisfaludi@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant