Skip to content

[enhancement](Multi-stage lm) Multi-Stage Predicate Lazy Materialization#64891

Open
nooneuse wants to merge 10 commits into
apache:masterfrom
nooneuse:multi_stage_predicate_lm
Open

[enhancement](Multi-stage lm) Multi-Stage Predicate Lazy Materialization#64891
nooneuse wants to merge 10 commits into
apache:masterfrom
nooneuse:multi_stage_predicate_lm

Conversation

@nooneuse

@nooneuse nooneuse commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

At the current stage, this feature is manually configured. It is intended to behave like a hint to influence execution behavior. In the future PR, the FE will leverage statistics to automatically choose suitable columns.

  • This PR introduces Multi-Stage Predicate Lazy Materialization (multi-stage predicate LM) (Inspired by the ClickHouse Prewhere function concept.) in the BE storage scan path (SegmentIterator) to reduce unnecessary predicate-column reads and predicate evaluation cost.
  • In many OLAP queries with multiple conjuncts, evaluating all predicates requires reading multiple predicate columns up-front. When some predicate columns are expensive to read/decode, this can cause avoidable I/O and CPU overhead even if many rows could be filtered out early.
  • Multi-stage predicate LM splits predicate evaluation into two phases:
    • Stage1 evaluates a selected subset of predicate columns first for coarse filtering.
    • Stage2 evaluates the remaining predicate columns only on the surviving rows, using either:
      • by-rowids (selective Stage1 → read late predicate columns by rowids to reduce rows read), or
      • by-all-rows (high survival ratio → scan all rows for late predicate columns to avoid excessive random reads).
  • The feature is manually controlled via session variables / per-statement overrides (hint-like behavior), enabling users to experiment and validate benefits with profile metrics before any future cost-based automation.

Release note

  • Added multi-stage predicate lazy materialization in BE SegmentIterator, splitting predicate evaluation into Stage1/Stage2 and exposing profile counters (e.g. PredicateLMStage1InputRows, PredicateLMStage2ByRowIdsBatches, PredicateLMStage2ByAllRowsBatches).
  • Added/exposed session variables:
    • enable_multi_stage_predicate_lm (bool): enable/disable multi-stage predicate LM.
    • predicate_lm_stage1_cols (string): optionally specify Stage1 predicate columns.
    • predicate_lm_stage1_survival_ratio_threshold (double): threshold to choose Stage2 strategy (by-rowids vs by-all-rows).
  • Enhanced predicate_lm_stage1_cols parsing and scoping:
    • Tolerates whitespace/backticks/duplicates.
    • Supports scoped identifiers col / table.col / db.table.col to target a specific scan in multi-table queries.
    • Uses scan context (db/table name; rollup suffix handling) for accurate scoped matching.
  • Improved robustness and safer defaults:
    • Unknown columns or non-matching scoped tokens in predicate_lm_stage1_cols are ignored instead of failing the query, avoiding multi-table query failures caused by schema differences.
    • When there is no runtime filter and the effective Stage1 column list is empty, the scan falls back to single-stage predicate evaluation (equivalent to multi-stage disabled for that scan) to avoid unpredictable “pick an arbitrary predicate column as Stage1” behavior and potential regressions.
  • Extended scan-node thrift payload to carry db_name for accurate db.table.col matching without changing table_name formatting; FE populates db_name and BE consumes it.
  • Updated BE unit tests, regression tests, and documentation accordingly.

Detailed Description

Multi-Stage Predicate Lazy Materialization

1. Overview (Summary)

Multi-Stage Predicate Lazy Materialization (multi-stage predicate LM) is a storage-layer scan optimization.
It splits “predicate column reading + predicate evaluation” into two stages (Stage1/Stage2). By “reading only a small subset of predicate columns first for coarse filtering, then evaluating the remaining predicates on the surviving rows”, it reduces unnecessary column reads and computation overhead.

At the current stage, this feature is manually configured. It is intended to behave like a hint to influence execution behavior. In the future PR, the FE will leverage statistics to automatically choose suitable columns.

Key goals:

  • Reduce total predicate-column I/O (especially when some columns are expensive to read/deserialize, or there are many predicate columns)
  • Reduce predicate evaluation cost (vectorized evaluation / short-circuit evaluation)
  • Reduce I/O in cases where predicates are selective

Behavior at a Glance

  • Stage1: Read the configured stage1 predicate columns (plus any required delete-condition columns, runtime filters, etc.) and perform the first round of filtering.
  • Stage2: Perform the second round of filtering using the “remaining predicate columns” that are not included in Stage1 (late predicate columns).
    Stage2 has two strategies:
    • by-rowids: If the survival ratio after Stage1 is low, read Stage2 predicate columns only for the rowids produced by Stage1 (random/rowid-based reads) to minimize the number of rows read.
    • by-all-rows: If the survival ratio after Stage1 is high, read Stage2 predicate columns for all rows (sequential/index-order reads) to avoid the seek overhead caused by many random rowid reads.

2. Usage and Configuration (with SQL Cases)

2.1 Prerequisites

  • Note: This feature works on the BE storage scan path (SegmentIterator) and is typically controlled by FE session variables forwarded to BE.
  • For stronger validation or observability, you can enable profiling:
    • set enable_profile=true;
    • Use EXPLAIN / PROFILE / show profile to inspect metrics.

2.2 Configuration Options

2.2.1 enable_multi_stage_predicate_lm

  • Type: bool
  • Default: false
  • Purpose: Enable/disable multi-stage predicate LM.
  • Example:
    • set enable_multi_stage_predicate_lm = true;

2.2.2 predicate_lm_stage1_cols

  • Type: string
  • Default: empty string (meaning: Stage1 columns are not explicitly specified)
  • Purpose: Specify which columns’ predicates should be evaluated in Stage1 (coarse filtering).
  • Notes:
    • The value is a comma-separated column list string, e.g.:
      • a
      • a,b
      • a ,``b``, a (whitespace/backticks/duplicates are allowed)
    • It also supports scoping to a specific table / db.table to precisely select which scan operator should apply the stage1 column configuration in multi-table queries:
      • table.col
      • db.table.col
    • Important:
      • Here table / db.table refers to the real base table name, not a SQL alias.
      • If a scoped token does not match the current scan (e.g., the scan is on t1 but the config includes t2.a), that token will be ignored.
      • If a specified column does not exist (or does not exist in the current scan table schema), it will not fail; the token will be ignored (to avoid multi-table queries failing due to schema differences).
    • Default behavior when predicate_lm_stage1_cols is empty (or becomes empty after ignoring invalid / non-matching tokens):
      • If runtime filter predicate columns are available for the scan, SegmentIterator will use those runtime filter columns as Stage1.
      • If there is no runtime filter column available, the implementation falls back to single-stage predicate evaluation (equivalent to enable_multi_stage_predicate_lm=false for that scan). In this case, Stage2 will not happen.
      • If you want multi-stage behavior for queries without runtime filters, explicitly configure predicate_lm_stage1_cols with at least one valid column for the target scan.
  • Examples:
    • set predicate_lm_stage1_cols = 'a';
    • set predicate_lm_stage1_cols = ' a ,``b``, a ';
    • set predicate_lm_stage1_cols = 'lineitem.l_shipdate';
    • set predicate_lm_stage1_cols = 'tpch.lineitem.l_shipdate';

2.2.3 predicate_lm_stage1_survival_ratio_threshold

  • Type: double
  • Default: 0.8 (storage-side default threshold)
  • Purpose: The Stage1 survival ratio threshold used to decide the Stage2 strategy:
    • survival_ratio <= threshold → prefer stage2-by-rowids
    • survival_ratio > threshold → prefer stage2-by-all-rows

2.3 Recommended Validation (via Profile Metrics)

In the SegmentIterator block of the profile, focus on:

  • PredicateLMStage1InputRows: number of input rows to Stage1
  • PredicateLMStage1OutputRows: number of output rows from Stage1 (rows surviving Stage1 filtering)
  • PredicateLMStage2ByRowIdsBatches: number of batches where Stage2 was triggered in by-rowids mode
  • PredicateLMStage2ByAllRowsBatches: number of batches where Stage2 was triggered in by-all-rows mode
  • PredicateLMStage2RowsRead: total rows read by Stage2 (semantics differ between by-rowids vs by-all-rows)

How to tell whether multi-stage predicate LM is enabled and effective:

  • PredicateLMStage1InputRows > 0 indicates the scan entered the Stage1 path
  • PredicateLMStage2ByRowIdsBatches > 0 or PredicateLMStage2ByAllRowsBatches > 0 indicates Stage2 actually happened
  • If both Stage2 batch counters stay at 0, it means Stage2 did not happen (either because it was not needed, or because the scan fell back to single-stage behavior)

2.4 SQL Case Examples (Typical Trigger Paths)

The following examples use table tbl_multi_stage_predicate_lm(k,a,b).

Case A: Baseline (Feature Off)

set enable_multi_stage_predicate_lm = false;
set predicate_lm_stage1_cols = '';

select count(*) from tbl_multi_stage_predicate_lm where a = 1 and b = 2;

Case B: Feature On + Stage2-by-rowids

Goal: Stage1 has a low survival ratio, so Stage2 reads late predicate columns by rowids.

set enable_profile=true;
set enable_multi_stage_predicate_lm = true;
set predicate_lm_stage1_cols = 'a';

-- a=1 is selective, Stage1 survival ratio is low, prefer stage2-by-rowids
select /* rowids_case */ count(*)
from tbl_multi_stage_predicate_lm
where a = 1 and b = 2;

Expected observation:

  • PredicateLMStage2ByRowIdsBatches > 0

Case C: Feature On + Stage2-by-all-rows

Goal: Stage1 has a high survival ratio, so Stage2 reads late predicate columns for all rows.

set enable_profile=true;
set enable_multi_stage_predicate_lm = true;
set predicate_lm_stage1_cols = 'a';

-- Stage1 has a high survival ratio (e.g. 95%), prefer stage2-by-all-rows
select /* allrows_case */ count(*)
from tbl_multi_stage_predicate_lm
where a < 19 and b = 2;

Expected observation:

  • PredicateLMStage2ByAllRowsBatches > 0

Case D: Scoped to Table / DB.Table (Recommended for Multi-Table Queries)

-- Only applies to column `col` of table `table_name`
set predicate_lm_stage1_cols = 'table_name.col';

-- Only applies to column `col` of table `db_name.table_name`
set predicate_lm_stage1_cols = 'db_name.table_name.col';

Case E: Invalid Column / Mismatched Scope Will Not Fail (Ignored)

Rationale: To avoid multi-table queries failing due to schema differences, invalid column names or mismatched scoped tokens will be ignored.

-- Non-existing column: no error; token will be ignored
set predicate_lm_stage1_cols = 'not_exist';

-- Scoped to another table: no error; token will be ignored
set predicate_lm_stage1_cols = 'other_table.a';

select count(*) from tbl_multi_stage_predicate_lm where a = 1 and b = 2;

Note:

  • If the effective Stage1 column list becomes empty after ignoring invalid / non-matching tokens, and there is no runtime filter column available, the scan will fall back to single-stage predicate evaluation (Stage2 will not happen). Use profile metrics to confirm whether Stage2 is triggered.

3. Applicable Scenarios

This feature is best suited for scenarios below (the more conditions are met, the more likely you will see gains):

  • Many predicate columns (many AND conjuncts), and some of those columns are expensive to read/decode
  • You can choose one or more “cheap and highly selective” columns as Stage1 (e.g., highly selective equality predicates, low-cost types)
  • Wide tables with many non-predicate columns, and lazy materialization is enabled/beneficial (filter first, then read non-predicate columns)
  • Queries have meaningful filtering (not a full scan or a high-survival scan with little filtering)
  • Runtime filters can serve as fast Stage1 filters (depending on implementation strategy)

Recommendations before enabling broadly:

  • Enable enable_profile=true and verify that the SegmentIterator block metrics PredicateLMStage1* / PredicateLMStage2* are actually hit.
  • Run a one-time A/B profile comparison on critical queries to confirm the effect.

Typical example:

  • WHERE a = const AND (b = const OR c IN (...)) AND d > const ...
    where a is highly selective and cheap, making it a good Stage1 candidate.

4. Risks and Notes

4.1 Performance Regression Risk (Misconfiguration)

  • If you put “low-selectivity / high-cost” columns into Stage1, Stage1 read/eval cost may increase and offset gains (or even regress).
  • If Stage1 survival ratio stays high but Stage2 is still frequently triggered, total read volume may approach or exceed the baseline (especially with stage2-by-all-rows).

Recommendations:

  • Prefer high-selectivity, low read-cost columns in Stage1.
  • Use A/B profiling on common query patterns before finalizing the stage1 column strategy.

4.2 Random Read Risk (Misconfiguration)

  • When stage2-by-rowids is chosen, rowid-based reads may increase seeks and can be unfriendly under certain column/page layouts.
  • This is why survival_ratio_threshold exists: when survival ratio is high, prefer all-rows reads to avoid random I/O.

Recommendations:

  • Tune the threshold reasonably and validate with real data distributions.

4.3 Multi-Table Query Considerations (Stage1 Column Scope)

  • If a query scans multiple internal OLAP tables:
    • Using an unqualified column name (e.g. a) may affect multiple scans at once (if those tables share the same column name).
    • Prefer table.col / db.table.col to scope the configuration to the target table and avoid unintended effects.
  • If a query joins with External Catalog tables:
    • The external scan path does not support multi-stage predicate LM, so predicate_lm_stage1_cols does not take effect on the external scan.

4.4 Silent Misconfiguration Risk

  • Invalid column names or non-matching scoped tokens are ignored (no error). This avoids query failures but may also cause the configuration to have no effect.
  • If the effective Stage1 column list becomes empty and there is no runtime filter column available, the scan may fall back to single-stage behavior.

Recommendations:

  • Enable profiling and verify Stage1/Stage2 metrics to confirm the feature is effective for the target query.
  • For queries without runtime filters, explicitly configure predicate_lm_stage1_cols with at least one valid column for the target scan.

5. Performance Data

WIP

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@nooneuse

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 71.43% (10/14) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29079 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f1d5dea50fa34b5bf71cf76b246aa618c90c1931, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17640	3998	4018	3998
q2	2052	336	189	189
q3	10324	1424	816	816
q4	4686	468	339	339
q5	7501	859	569	569
q6	193	173	134	134
q7	819	849	630	630
q8	9381	1561	1571	1561
q9	5626	4531	4548	4531
q10	6837	1774	1569	1569
q11	445	282	239	239
q12	626	415	288	288
q13	18084	3351	2749	2749
q14	263	260	243	243
q15	q16	778	788	707	707
q17	1011	1013	924	924
q18	7081	5821	5603	5603
q19	1212	1206	1044	1044
q20	533	405	263	263
q21	5435	2631	2382	2382
q22	437	360	301	301
Total cold run time: 100964 ms
Total hot run time: 29079 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4323	4232	4267	4232
q2	320	349	219	219
q3	4580	5081	4428	4428
q4	2058	2155	1377	1377
q5	4479	4341	4334	4334
q6	231	178	126	126
q7	1752	1608	1683	1608
q8	2705	2228	2204	2204
q9	8345	8428	7994	7994
q10	4785	4749	4246	4246
q11	563	405	373	373
q12	730	778	552	552
q13	3226	3579	3013	3013
q14	310	307	276	276
q15	q16	711	735	644	644
q17	1350	1334	1332	1332
q18	7933	7233	7241	7233
q19	1164	1129	1129	1129
q20	2206	2193	1987	1987
q21	5261	4552	4500	4500
q22	538	454	405	405
Total cold run time: 57570 ms
Total hot run time: 52212 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 171950 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f1d5dea50fa34b5bf71cf76b246aa618c90c1931, data reload: false

query5	4366	640	493	493
query6	456	201	179	179
query7	4822	512	313	313
query8	352	185	171	171
query9	8780	4060	4051	4051
query10	483	321	285	285
query11	5936	2342	2126	2126
query12	160	103	100	100
query13	1289	615	434	434
query14	6294	5331	4983	4983
query14_1	4320	4270	4289	4270
query15	221	207	182	182
query16	1058	483	478	478
query17	1142	752	608	608
query18	2716	493	358	358
query19	222	199	157	157
query20	118	113	107	107
query21	215	140	122	122
query22	13735	13582	13371	13371
query23	17428	16442	16106	16106
query23_1	16356	16161	16241	16161
query24	7455	1789	1271	1271
query24_1	1327	1324	1342	1324
query25	575	467	391	391
query26	1297	314	177	177
query27	2638	577	328	328
query28	4439	2040	2016	2016
query29	1106	625	500	500
query30	311	241	205	205
query31	1113	1085	947	947
query32	166	62	64	62
query33	550	336	262	262
query34	1190	1132	653	653
query35	770	776	669	669
query36	1399	1405	1262	1262
query37	156	107	95	95
query38	1889	1733	1641	1641
query39	939	917	894	894
query39_1	885	863	861	861
query40	218	122	103	103
query41	64	63	63	63
query42	89	87	87	87
query43	320	317	276	276
query44	1414	799	768	768
query45	216	190	180	180
query46	1074	1256	748	748
query47	2360	2341	2284	2284
query48	416	418	262	262
query49	597	444	315	315
query50	1052	368	259	259
query51	4449	4364	4407	4364
query52	82	80	70	70
query53	253	254	188	188
query54	275	218	195	195
query55	74	72	66	66
query56	245	223	212	212
query57	1456	1392	1319	1319
query58	264	219	211	211
query59	1603	1649	1443	1443
query60	282	251	228	228
query61	152	146	148	146
query62	693	664	585	585
query63	232	191	186	186
query64	2503	763	602	602
query65	4867	4799	4799	4799
query66	1748	463	350	350
query67	28892	28702	28679	28679
query68	3228	1509	917	917
query69	417	306	270	270
query70	1033	943	936	936
query71	315	245	213	213
query72	2970	2598	2348	2348
query73	851	745	433	433
query74	5094	4971	4770	4770
query75	2583	2549	2167	2167
query76	2309	1204	782	782
query77	349	385	279	279
query78	12469	12619	11710	11710
query79	1531	1176	786	786
query80	1291	470	384	384
query81	502	275	238	238
query82	762	159	124	124
query83	354	276	247	247
query84	269	143	114	114
query85	940	519	411	411
query86	436	301	278	278
query87	1841	1835	1770	1770
query88	3707	2764	2748	2748
query89	435	390	331	331
query90	1947	180	181	180
query91	176	159	136	136
query92	66	61	57	57
query93	1558	1436	911	911
query94	762	363	281	281
query95	683	481	358	358
query96	1066	839	348	348
query97	2685	2697	2541	2541
query98	218	208	198	198
query99	1195	1211	1033	1033
Total cold run time: 258987 ms
Total hot run time: 171950 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.39 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f1d5dea50fa34b5bf71cf76b246aa618c90c1931, data reload: false

query1	0.01	0.00	0.00
query2	0.10	0.05	0.05
query3	0.29	0.14	0.13
query4	1.61	0.13	0.14
query5	0.25	0.21	0.22
query6	1.31	1.06	1.03
query7	0.05	0.01	0.01
query8	0.06	0.04	0.04
query9	0.39	0.34	0.31
query10	0.55	0.56	0.54
query11	0.20	0.15	0.14
query12	0.19	0.15	0.15
query13	0.47	0.47	0.48
query14	0.99	1.01	1.01
query15	0.62	0.60	0.59
query16	0.30	0.31	0.33
query17	1.09	1.10	1.13
query18	0.23	0.21	0.21
query19	2.08	1.94	2.00
query20	0.01	0.01	0.02
query21	15.44	0.20	0.14
query22	4.93	0.05	0.06
query23	16.13	0.32	0.13
query24	3.00	0.44	0.37
query25	0.13	0.05	0.05
query26	0.76	0.21	0.15
query27	0.04	0.04	0.05
query28	3.50	0.91	0.55
query29	12.55	4.45	3.57
query30	0.28	0.15	0.16
query31	2.78	0.60	0.31
query32	3.22	0.60	0.48
query33	3.23	3.19	3.19
query34	15.66	4.21	3.51
query35	3.49	3.51	3.56
query36	0.59	0.43	0.43
query37	0.10	0.07	0.07
query38	0.06	0.04	0.04
query39	0.04	0.03	0.03
query40	0.18	0.16	0.15
query41	0.08	0.03	0.04
query42	0.04	0.03	0.03
query43	0.06	0.03	0.03
Total cold run time: 97.09 s
Total hot run time: 25.39 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 87.39% (478/547) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.16% (28491/38418)
Line Coverage 58.02% (310409/535010)
Region Coverage 54.75% (259422/473802)
Branch Coverage 56.10% (112756/201003)

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 71.43% (10/14) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants