[enhancement](Multi-stage lm) Multi-Stage Predicate Lazy Materialization#64891
Open
nooneuse wants to merge 10 commits into
Open
[enhancement](Multi-stage lm) Multi-Stage Predicate Lazy Materialization#64891nooneuse wants to merge 10 commits into
nooneuse wants to merge 10 commits into
Conversation
added 10 commits
June 24, 2026 21:25
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run buildall |
Contributor
FE UT Coverage ReportIncrement line coverage |
Contributor
TPC-H: Total hot run time: 29079 ms |
Contributor
TPC-DS: Total hot run time: 171950 ms |
Contributor
ClickBench: Total hot run time: 25.39 s |
Contributor
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
FE Regression Coverage ReportIncrement line coverage |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
SegmentIterator) to reduce unnecessary predicate-column reads and predicate evaluation cost.Release note
SegmentIterator, splitting predicate evaluation into Stage1/Stage2 and exposing profile counters (e.g.PredicateLMStage1InputRows,PredicateLMStage2ByRowIdsBatches,PredicateLMStage2ByAllRowsBatches).enable_multi_stage_predicate_lm(bool): enable/disable multi-stage predicate LM.predicate_lm_stage1_cols(string): optionally specify Stage1 predicate columns.predicate_lm_stage1_survival_ratio_threshold(double): threshold to choose Stage2 strategy (by-rowids vs by-all-rows).predicate_lm_stage1_colsparsing and scoping:col/table.col/db.table.colto target a specific scan in multi-table queries.predicate_lm_stage1_colsare ignored instead of failing the query, avoiding multi-table query failures caused by schema differences.db_namefor accuratedb.table.colmatching without changingtable_nameformatting; FE populatesdb_nameand BE consumes it.Detailed Description
Multi-Stage Predicate Lazy Materialization
1. Overview (Summary)
Multi-Stage Predicate Lazy Materialization (multi-stage predicate LM) is a storage-layer scan optimization.
It splits “predicate column reading + predicate evaluation” into two stages (Stage1/Stage2). By “reading only a small subset of predicate columns first for coarse filtering, then evaluating the remaining predicates on the surviving rows”, it reduces unnecessary column reads and computation overhead.
At the current stage, this feature is manually configured. It is intended to behave like a hint to influence execution behavior. In the future PR, the FE will leverage statistics to automatically choose suitable columns.
Key goals:
Behavior at a Glance
Stage2 has two strategies:
2. Usage and Configuration (with SQL Cases)
2.1 Prerequisites
set enable_profile=true;EXPLAIN/PROFILE/show profileto inspect metrics.2.2 Configuration Options
2.2.1
enable_multi_stage_predicate_lmset enable_multi_stage_predicate_lm = true;2.2.2
predicate_lm_stage1_colsaa,ba ,``b``, a(whitespace/backticks/duplicates are allowed)table.coldb.table.coltable/db.tablerefers to the real base table name, not a SQL alias.t1but the config includest2.a), that token will be ignored.predicate_lm_stage1_colsis empty (or becomes empty after ignoring invalid / non-matching tokens):enable_multi_stage_predicate_lm=falsefor that scan). In this case, Stage2 will not happen.predicate_lm_stage1_colswith at least one valid column for the target scan.set predicate_lm_stage1_cols = 'a';set predicate_lm_stage1_cols = ' a ,``b``, a ';set predicate_lm_stage1_cols = 'lineitem.l_shipdate';set predicate_lm_stage1_cols = 'tpch.lineitem.l_shipdate';2.2.3
predicate_lm_stage1_survival_ratio_thresholdsurvival_ratio <= threshold→ preferstage2-by-rowidssurvival_ratio > threshold→ preferstage2-by-all-rows2.3 Recommended Validation (via Profile Metrics)
In the
SegmentIteratorblock of the profile, focus on:PredicateLMStage1InputRows: number of input rows to Stage1PredicateLMStage1OutputRows: number of output rows from Stage1 (rows surviving Stage1 filtering)PredicateLMStage2ByRowIdsBatches: number of batches where Stage2 was triggered in by-rowids modePredicateLMStage2ByAllRowsBatches: number of batches where Stage2 was triggered in by-all-rows modePredicateLMStage2RowsRead: total rows read by Stage2 (semantics differ between by-rowids vs by-all-rows)How to tell whether multi-stage predicate LM is enabled and effective:
PredicateLMStage1InputRows > 0indicates the scan entered the Stage1 pathPredicateLMStage2ByRowIdsBatches > 0orPredicateLMStage2ByAllRowsBatches > 0indicates Stage2 actually happened2.4 SQL Case Examples (Typical Trigger Paths)
The following examples use table
tbl_multi_stage_predicate_lm(k,a,b).Case A: Baseline (Feature Off)
Case B: Feature On + Stage2-by-rowids
Goal: Stage1 has a low survival ratio, so Stage2 reads late predicate columns by rowids.
Expected observation:
PredicateLMStage2ByRowIdsBatches > 0Case C: Feature On + Stage2-by-all-rows
Goal: Stage1 has a high survival ratio, so Stage2 reads late predicate columns for all rows.
Expected observation:
PredicateLMStage2ByAllRowsBatches > 0Case D: Scoped to Table / DB.Table (Recommended for Multi-Table Queries)
Case E: Invalid Column / Mismatched Scope Will Not Fail (Ignored)
Rationale: To avoid multi-table queries failing due to schema differences, invalid column names or mismatched scoped tokens will be ignored.
Note:
3. Applicable Scenarios
This feature is best suited for scenarios below (the more conditions are met, the more likely you will see gains):
Recommendations before enabling broadly:
enable_profile=trueand verify that theSegmentIteratorblock metricsPredicateLMStage1*/PredicateLMStage2*are actually hit.Typical example:
WHERE a = const AND (b = const OR c IN (...)) AND d > const ...where
ais highly selective and cheap, making it a good Stage1 candidate.4. Risks and Notes
4.1 Performance Regression Risk (Misconfiguration)
Recommendations:
4.2 Random Read Risk (Misconfiguration)
survival_ratio_thresholdexists: when survival ratio is high, prefer all-rows reads to avoid random I/O.Recommendations:
4.3 Multi-Table Query Considerations (Stage1 Column Scope)
a) may affect multiple scans at once (if those tables share the same column name).table.col/db.table.colto scope the configuration to the target table and avoid unintended effects.predicate_lm_stage1_colsdoes not take effect on the external scan.4.4 Silent Misconfiguration Risk
Recommendations:
predicate_lm_stage1_colswith at least one valid column for the target scan.5. Performance Data
WIP
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)