Fix bundle finalizer callbacks for SDF#39075
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses an issue with bundle finalizer callbacks in the Dataflow streaming worker. By ensuring that finalize IDs are captured during the commit process and subsequently queued for application, the changes ensure that bundle finalization logic is correctly triggered upon successful work completion. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the CompleteCommit class to include finalizeIds extracted from the commit request. The StreamingDataflowWorker is updated to queue these applied finalize IDs upon a successful commit status. Additionally, StreamingApplianceWorkCommitter now passes the finalize IDs when creating a CompleteCommit. A trivial modification was also made to trigger the Dataflow Streaming post-commit tests. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Assigning reviewers: R: @shunping added as fallback since no labels match configuration Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
R: @arunpandianp could you please take a look? FYI the tests becomes very flaky in recent weeks |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
|
@aIbrahiim could you please explain a bit what was the cause of the flaky tests |
Yes sure, just updated the PR |
|
R: @acrites |
|
currently prefer #39092 just disabling the tests at the moment, since Dataflow classic streaming runner SDF support not completed yet |
Fixes: #38710
Successful run: https://github.com/apache/beam/actions/runs/28039649218/job/83002285984?pr=39075
These tests were failing:
SplittableDoFnTest.testBundleFinalizationOccursOnBoundedSplittableDoFn SplittableDoFnTest.testBundleFinalizationOccursOnUnboundedSplittableDoFn
as they often timed out after 900 seconds with the job still in RUNNING
The tests use a Splittable DoFn that registers a BundleFinalizer callback and keeps checkpointing with resume() until that callback runs. If the callback never runs, the DoFn never outputs, the pipeline never finishes, and the test hits the timeout. The problem was in the Dataflow legacy streaming worker. Bundle-finalizer callbacks were only run when Windmill sent the finalize IDs back on a later work item (via source_state.finalize_ids or applied_finalize_ids in GetWorkResponse). Windmill documents that this is best effort, the commit can succeed but the IDs may never come back.
For Splittable DoFns with many resume() checkpoints, each step does a new commit and waits for the callback on the next work item and on real Dataflow those IDs often dont come back, so callbacks never run and the job hangs.
This got worse after bundle finalizer support was added for the streaming legacy worker:
#37723 Added bundle finalizer support to the non portable Dataflow worker and streaming ValidatesRunner tests were first excluded because they were still failing.
#37954 moved applied_finalize_ids handling and reenabled UsesBundleFinalizer tests for legacy streaming ValidatesRunner and hat helped some cases, but SDF tests could still hang on real Dataflow when finalize IDs were not echoed back.
#38287 in my previous PR I tried to stabilize SplittableDoFnTest on the test side (thread safety, timeouts) but that did not fix the worker behavior.
So when Windmill acknowledges a successful commit (CommitStatus.OK), we now run bundle finalizer callbacks right away, using the finalize_ids from that commit then a successful commit ACK means the state was persisted and it is same idea as applied_finalize_ids, but without relying on a later best effort GetWork.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.