Skip to content

feat: reconnect on map(udf) failures#3488

Draft
adarsh0728 wants to merge 5 commits into
mainfrom
map-reconnect
Draft

feat: reconnect on map(udf) failures#3488
adarsh0728 wants to merge 5 commits into
mainfrom
map-reconnect

Conversation

@adarsh0728

@adarsh0728 adarsh0728 commented Jun 30, 2026

Copy link
Copy Markdown
Member

What this PR does / why we need it

  • Adds reconnect/redrive handling for map UDF failures so numa does not exit
  • Record map UDF failures through critical_error_total{reason="map_runtime_error"}
  • Persists runtime errors with udf as the fallback container name when the SDK error does not include one
  • Removed unit tests that attempted to model map panic recovery with in process SDK servers, since actual panic recovery requires Kubernetes to restart the UDF sidecar. Will have e2e tests as part of e2e tests to cover UDF failure scenarios #3371

Main changes:

  • Adds reconnect config support for map UDF clients.
  • Handles unary map send/stream failures through reconnect and redrive.
  • Handles batch map send failures, EOT send failures, and partial EOT as redrivable errors.
  • Handles stream map request/response stream failures through reconnect and message redrive.
  • Adds common map redrive helpers for metrics, runtime error persistence, and reconnect client creation.

Related issues

Part of #3367 #3368

Testing

WIP

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 73.99166% with 187 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.87%. Comparing base (948bf6b) to head (732f29d).

Files with missing lines Patch % Lines
rust/numaflow-core/src/mapper/map/stream.rs 64.86% 65 Missing ⚠️
rust/numaflow-core/src/mapper/map/unary.rs 60.93% 50 Missing ⚠️
rust/numaflow-core/src/mapper/map.rs 59.57% 38 Missing ⚠️
rust/numaflow-core/src/mapper/map/batch.rs 87.03% 28 Missing ⚠️
rust/numaflow-monitor/src/runtime.rs 93.93% 4 Missing ⚠️
rust/numaflow-core/src/metrics/mod.rs 0.00% 1 Missing ⚠️
rust/numaflow-core/src/source/user_defined.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3488      +/-   ##
==========================================
- Coverage   83.11%   82.87%   -0.25%     
==========================================
  Files         308      308              
  Lines       80914    80954      +40     
==========================================
- Hits        67252    67089     -163     
- Misses      13097    13300     +203     
  Partials      565      565              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: adarsh0728 <gooneriitk@gmail.com>
Signed-off-by: adarsh0728 <gooneriitk@gmail.com>
…w will be covered as part of e2e tests

Signed-off-by: adarsh0728 <gooneriitk@gmail.com>
Signed-off-by: adarsh0728 <gooneriitk@gmail.com>
Signed-off-by: adarsh0728 <gooneriitk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant