[fix](be) Avoid finalized pipeline task submit crash#64899
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 29197 ms |
TPC-DS: Total hot run time: 171345 ms |
ClickBench: Total hot run time: 25.07 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
I found one test coverage gap in the new scheduler lifecycle protection.
Critical checkpoint conclusions:
- Goal: the code adds a submit gate to avoid late HybridTaskScheduler::submit() calling is_blockable() after terminal task cleanup starts. The production path is small and focused, but the added test only proves post-finalize rejection.
- Concurrency/lifecycle: the relevant race is dependency wake-up submit versus terminal close/finalize. The new lock/gate is aimed at HybridTaskScheduler::submit(); no deadlock or normal scheduling-order regression was found in the inspected paths, but close-time behavior needs direct coverage.
- Config/compatibility/persistence/data writes: no new config, wire/storage compatibility, transaction, or persistence behavior is involved.
- Parallel paths: reviewed wake_up, dependency set_ready/is_blocked_by, self-resubmit, RevokableTask, and blockable join/aggregate/spill paths; no separate correctness issue found.
- Tests: TEST_FINALIZED_TASK_REJECTS_HYBRID_SUBMIT should also cover the close-time gate before finalize, otherwise the close() change can regress while this test still passes.
Validation:
- git diff --check for the four GitHub PR files passed.
- build-support/check-format.sh could not run to completion because the runner clang-format is not v16.
- ./run-be-ut.sh --run --filter=PipelineTaskTest.TEST_FINALIZED_TASK_REJECTS_HYBRID_SUBMIT failed during setup because thirdparty/installed/bin/protoc is missing.
User focus: no additional user-provided review focus was supplied.
Subagent conclusions: optimizer-rewrite reported no candidates. tests-session-config reported TSC-1, which became the single inline comment. Convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for this same final comment set.
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: A runtime filter dependency can wake a pipeline task while another thread is closing or finalizing the same task. HybridTaskScheduler::submit() synchronously calls PipelineTask::is_blockable() before enqueueing the task, and is_blockable() reads _sink and _operators. After close/finalize starts releasing terminal task resources, a late submit can therefore dereference cleared task resources and crash.
This change adds a small submit gate on PipelineTask. Terminal close/finalize closes that gate under a dedicated lock, and HybridTaskScheduler::submit() checks the gate under the same lock before calling is_blockable(). The lock only covers the blockable check and does not extend to the actual queue submit.
### Release note
None
### Check List (For Author)
- Test: Unit Test / Static check
- ./run-be-ut.sh --run --filter=PipelineTaskTest.TEST_FINALIZED_TASK_REJECTS_HYBRID_SUBMIT
- build-support/check-format.sh
- CLANG_TIDY_BINARY=/mnt/disk6/common/ldb_toolchain_028/bin/clang-tidy build-support/run-clang-tidy.sh --base upstream/master --build-dir be/ut_build_ASAN
- git diff --check
- Behavior changed: No
- Does this need documentation: No
0e148ec to
1dd3207
Compare
|
/review |
There was a problem hiding this comment.
Automated review summary for current head 1dd3207f04a7311ddec655990fca58d68b0b0b9d.
Opinion: I found no additional blocking issue to raise on the current head. The existing inline thread about the finalized-only test coverage should not be duplicated; the current head adds the close-before-finalize hybrid-submit coverage that thread requested.
Critical checkpoint conclusions:
- Goal and test proof: the PR prevents
HybridTaskScheduler::submit()from callingPipelineTask::is_blockable()after terminal close/finalize begins. The gate is closed fromclose(close_sink=true)andfinalize(), and hybrid submit checks it under_blockable_check_lockbefore reading_sink/_operators. The new close-before-finalize and finalized-task BE tests cover the intended submit rejection cases. - Scope: the change is small and focused on BE pipeline task lifecycle/scheduler handling plus BE unit coverage. I saw no new config, persistence, FE/BE protocol, storage-format, or transaction surface.
- Concurrency and lifecycle:
_blockable_check_lockserializes the accept-submit gate withHybridTaskScheduler::submit(), while the existing_dependency_lifecycle_lockstill protects forced dependency unblocking against close/finalize resource clearing. In the production close path,_runningremains true throughclose_task()/finalize(), so a task queued just before the gate closes is requeued until finalization and then skipped. - Parallel paths: the hybrid scheduler applies the gate before choosing its internal blocking/simple queue.
RevokableTask::is_blockable()remains a separate virtual path and does not dereference the underlying task's sink/operators. - Tests and validation:
git diff --checkis clean for the changed files. I did not run BE UT/build because this runner lacksthirdparty/installedandthirdparty/installed/bin/protoc; Doris format validation is also blocked here because the availableclang-formatis 18.1.3 while the repo requires clang-format 16. - User focus: no additional user-provided focus points were present.
Subagent conclusions: optimizer-rewrite reported no new valuable findings after checking that optimizer/Nereids risks are out of scope and reviewing analogous lifecycle/parallel submit paths. tests-session-config reported no new valuable findings after checking the added tests, duplicate context, and basic whitespace/tooling state. Final convergence round 1 ended with both subagents replying NO_NEW_VALUABLE_FINDINGS for the same ledger and empty proposed inline comment set.
|
run buildall |
TPC-H: Total hot run time: 29431 ms |
TPC-DS: Total hot run time: 172007 ms |
ClickBench: Total hot run time: 25.15 s |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: A runtime filter dependency can wake a pipeline task while another thread is closing or finalizing the same task.
HybridTaskScheduler::submit()synchronously callsPipelineTask::is_blockable()before enqueueing, andis_blockable()reads_sinkand_operators. After terminal close/finalize starts releasing those resources, a late submit can dereference cleared task resources and crash.This PR adds a small submit gate on
PipelineTask. Terminal close/finalize closes that gate under a dedicated lock, andHybridTaskScheduler::submit()checks it under the same lock before callingis_blockable(). The lock only covers the blockable check; it does not cover the actual queue submit.Release note
None
Check List (For Author)
./run-be-ut.sh --run --filter=PipelineTaskTest.TEST_FINALIZED_TASK_REJECTS_HYBRID_SUBMITbuild-support/check-format.shCLANG_TIDY_BINARY=/mnt/disk6/common/ldb_toolchain_028/bin/clang-tidy build-support/run-clang-tidy.sh --base upstream/master --build-dir be/ut_build_ASANgit diff --check