Skip to content

[fix](be) Fix file cache queue evict size metrics#64897

Open
deardeng wants to merge 1 commit into
apache:masterfrom
deardeng:fix-file-cache-metrics
Open

[fix](be) Fix file cache queue evict size metrics#64897
deardeng wants to merge 1 commit into
apache:masterfrom
deardeng:fix-file-cache-metrics

Conversation

@deardeng

Copy link
Copy Markdown
Collaborator

Problem Summary: File cache queue evict size metrics were constructed with literal array indexes, while increments use FileCacheType enum values. Because FileCacheType maps DISPOSABLE to 0 and INDEX to 2, file_cache_index_queue_evict_size actually counted disposable queue evictions, and file_cache_disposable_queue_evict_size counted index queue evictions. This change initializes the metrics array with explicit FileCacheType indexes so each bvar name matches the queue type that increments it.

Release note

None

Check List (For Author)

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@deardeng

Copy link
Copy Markdown
Collaborator Author

run buildall

@deardeng

Copy link
Copy Markdown
Collaborator Author

/review

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29730 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e2e5c8676eaa115f6d5da27cbd248c3e5f66ce47, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17801	4124	4050	4050
q2	2033	313	192	192
q3	10311	1409	826	826
q4	4680	478	339	339
q5	7526	873	585	585
q6	177	171	134	134
q7	810	855	637	637
q8	9386	1624	1673	1624
q9	5514	4535	4515	4515
q10	6749	1809	1578	1578
q11	435	281	247	247
q12	636	422	308	308
q13	18106	3461	2778	2778
q14	270	270	241	241
q15	q16	785	777	713	713
q17	1060	933	1009	933
q18	7203	5842	5806	5806
q19	1186	1299	1138	1138
q20	506	399	272	272
q21	5674	2708	2512	2512
q22	429	359	302	302
Total cold run time: 101277 ms
Total hot run time: 29730 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4351	4249	4288	4249
q2	315	354	218	218
q3	4593	5023	4403	4403
q4	2116	2197	1404	1404
q5	4468	4307	4273	4273
q6	241	179	130	130
q7	1712	1928	1858	1858
q8	2586	2251	2218	2218
q9	8366	8383	8222	8222
q10	4864	4855	4289	4289
q11	595	427	386	386
q12	786	808	566	566
q13	3275	3583	2953	2953
q14	293	304	260	260
q15	q16	709	731	650	650
q17	1352	1329	1452	1329
q18	7801	7333	7294	7294
q19	1180	1137	1134	1134
q20	2231	2226	1952	1952
q21	5315	4629	4496	4496
q22	522	469	427	427
Total cold run time: 57671 ms
Total hot run time: 52711 ms

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: File cache queue evict size metrics were constructed with literal array indexes, while increments use FileCacheType enum values. Because FileCacheType maps DISPOSABLE to 0 and INDEX to 2, file_cache_index_queue_evict_size actually counted disposable queue evictions, and file_cache_disposable_queue_evict_size counted index queue evictions. This change initializes the metrics array with explicit FileCacheType indexes so each bvar name matches the queue type that increments it.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran Doris commit precheck.
    - Ran git diff --check and git diff --cached --check.
    - Attempted build-support/check-format.sh, but the available clang-format is version 20/18 and the script requires version 16.
    - Attempted build-support/run-clang-tidy.sh --build-dir be/build_Release, but the current environment reports missing system header stddef.h and pre-existing clang-tidy errors in included/existing code.
- Behavior changed: Yes. File cache queue evict size bvar metrics now report the queue matching their metric names.
- Does this need documentation: No
@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 171798 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e2e5c8676eaa115f6d5da27cbd248c3e5f66ce47, data reload: false

query5	4316	623	489	489
query6	443	188	181	181
query7	4813	562	313	313
query8	335	183	161	161
query9	8770	4049	4036	4036
query10	453	324	261	261
query11	5874	2376	2170	2170
query12	162	102	105	102
query13	1284	600	442	442
query14	6277	5304	4984	4984
query14_1	4333	4306	4296	4296
query15	213	206	180	180
query16	1007	476	454	454
query17	978	735	603	603
query18	2447	479	353	353
query19	211	199	154	154
query20	132	114	111	111
query21	223	141	116	116
query22	13913	13602	13480	13480
query23	17534	16628	16210	16210
query23_1	16304	16257	16285	16257
query24	7524	1786	1296	1296
query24_1	1306	1312	1279	1279
query25	583	471	396	396
query26	1290	322	180	180
query27	2657	559	361	361
query28	4429	2056	2028	2028
query29	1087	632	515	515
query30	305	244	197	197
query31	1120	1092	962	962
query32	131	62	61	61
query33	548	331	285	285
query34	1169	1138	657	657
query35	769	788	689	689
query36	1395	1357	1189	1189
query37	157	110	139	110
query38	1897	1725	1659	1659
query39	921	913	892	892
query39_1	899	880	900	880
query40	219	118	101	101
query41	65	63	73	63
query42	96	86	86	86
query43	320	321	279	279
query44	1419	776	783	776
query45	208	185	176	176
query46	1056	1153	729	729
query47	2395	2345	2253	2253
query48	392	392	306	306
query49	584	427	314	314
query50	1004	376	257	257
query51	4457	4416	4344	4344
query52	81	79	69	69
query53	244	257	200	200
query54	266	220	201	201
query55	74	72	67	67
query56	250	227	212	212
query57	1427	1401	1321	1321
query58	246	212	197	197
query59	1532	1610	1440	1440
query60	278	245	237	237
query61	149	153	149	149
query62	695	645	584	584
query63	227	191	197	191
query64	2494	747	596	596
query65	4883	4766	4787	4766
query66	1776	512	328	328
query67	28965	28985	28724	28724
query68	3018	1560	922	922
query69	418	298	262	262
query70	1043	955	954	954
query71	298	233	208	208
query72	3053	2607	2349	2349
query73	859	802	442	442
query74	5136	4980	4782	4782
query75	2579	2548	2219	2219
query76	2322	1189	771	771
query77	349	383	295	295
query78	12467	12543	11882	11882
query79	1427	1133	748	748
query80	1257	468	391	391
query81	519	285	239	239
query82	924	161	126	126
query83	341	272	246	246
query84	310	146	116	116
query85	914	528	412	412
query86	418	296	285	285
query87	1858	1847	1771	1771
query88	3707	2826	2787	2787
query89	435	387	331	331
query90	1928	171	180	171
query91	173	163	132	132
query92	63	59	57	57
query93	1571	1431	858	858
query94	809	355	316	316
query95	660	462	365	365
query96	1029	775	338	338
query97	2681	2686	2618	2618
query98	212	207	218	207
query99	1168	1155	1024	1024
Total cold run time: 258108 ms
Total hot run time: 171798 ms

@deardeng deardeng force-pushed the fix-file-cache-metrics branch from e2e5c86 to c875352 Compare June 26, 2026 10:55
@deardeng

Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.21 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit e2e5c8676eaa115f6d5da27cbd248c3e5f66ce47, data reload: false

query1	0.01	0.01	0.00
query2	0.10	0.05	0.05
query3	0.25	0.13	0.13
query4	1.60	0.13	0.14
query5	0.24	0.24	0.21
query6	1.26	1.08	1.11
query7	0.04	0.00	0.00
query8	0.06	0.04	0.04
query9	0.37	0.33	0.30
query10	0.59	0.58	0.55
query11	0.19	0.14	0.14
query12	0.18	0.14	0.14
query13	0.46	0.47	0.50
query14	1.00	1.01	1.01
query15	0.62	0.60	0.62
query16	0.33	0.31	0.31
query17	1.07	1.13	1.04
query18	0.23	0.21	0.21
query19	2.02	1.98	1.90
query20	0.02	0.01	0.01
query21	15.46	0.24	0.13
query22	4.84	0.05	0.06
query23	16.13	0.30	0.12
query24	3.00	0.41	0.34
query25	0.11	0.06	0.04
query26	0.73	0.20	0.16
query27	0.05	0.04	0.04
query28	3.48	0.95	0.53
query29	12.64	4.34	3.44
query30	0.28	0.15	0.16
query31	2.77	0.60	0.32
query32	3.22	0.60	0.49
query33	3.15	3.23	3.24
query34	15.45	4.16	3.55
query35	3.54	3.50	3.52
query36	0.56	0.47	0.43
query37	0.09	0.06	0.07
query38	0.05	0.04	0.04
query39	0.04	0.03	0.03
query40	0.18	0.15	0.16
query41	0.08	0.03	0.03
query42	0.03	0.03	0.03
query43	0.04	0.03	0.04
Total cold run time: 96.56 s
Total hot run time: 25.21 s

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the live PR diff for be/src/io/cache/block_file_cache.cpp against base 2893331370b0f4502acfbf83d71e8d11e13ce624 and head c87535241e61b26ce1f4ee3cf2521b369deed1cd.

The change is focused on file-cache eviction bvar observability: _queue_evict_size_metrics is now constructed with the same FileCacheType indexes used by the remove path, and the helper used at increment time is semantically the same enum-to-index mapping. The four enum values remain DISPOSABLE=0, NORMAL=1, INDEX=2, and TTL=3, so the metric names now line up with the queue whose evictions are counted. I also checked adjacent eviction-by-size/time, self-LRU, LRU recorder, and stats paths and did not find a remaining parallel metric-label issue.

Critical checkpoint conclusions: the PR accomplishes its stated observability fix; the change is small and localized; no new concurrency, lifecycle, config, compatibility, FE/BE protocol, persistence, data write, or optimizer/rewrite behavior is introduced; existing cache tests cover the relevant queue/eviction paths, though I could not run BE tests in this runner because thirdparty/installed is absent. git diff --check on the live PR range passed.

Subagent conclusions: optimizer-rewrite found no applicable optimizer/rewrite surface and no candidates. tests-session-config found no test, session/config, compatibility, or style candidates. Convergence round 1 ended with both live subagents returning NO_NEW_VALUABLE_FINDINGS for the same current ledger/comment set.

User focus: no additional user-provided review focus was supplied.

@github-actions

Copy link
Copy Markdown
Contributor

Codex automated review failed and did not complete.

Error: Codex completed, but no new pull request review was submitted for the current head SHA.
Workflow run: https://github.com/apache/doris/actions/runs/28230991874

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 28562 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c87535241e61b26ce1f4ee3cf2521b369deed1cd, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17859	4004	3952	3952
q2	2010	309	190	190
q3	10304	1396	812	812
q4	4685	464	340	340
q5	7499	871	596	596
q6	184	166	136	136
q7	774	838	625	625
q8	9892	1623	1600	1600
q9	6034	4499	4471	4471
q10	6768	1778	1501	1501
q11	435	276	241	241
q12	661	421	286	286
q13	18105	3322	2676	2676
q14	285	271	235	235
q15	q16	788	773	700	700
q17	2365	1243	689	689
q18	6908	5862	5536	5536
q19	1684	1357	1030	1030
q20	474	397	265	265
q21	5639	2615	2377	2377
q22	433	360	304	304
Total cold run time: 103786 ms
Total hot run time: 28562 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4383	4228	4236	4228
q2	316	346	224	224
q3	4593	4949	4392	4392
q4	2062	2125	1378	1378
q5	4405	4270	4300	4270
q6	237	176	130	130
q7	1731	2074	1639	1639
q8	2599	2150	2124	2124
q9	8064	8105	8104	8104
q10	4820	4728	4291	4291
q11	556	405	531	405
q12	747	737	563	563
q13	3210	3496	2999	2999
q14	290	314	291	291
q15	q16	736	735	654	654
q17	1344	1341	1317	1317
q18	7849	7209	6938	6938
q19	1120	1090	1108	1090
q20	2264	2220	1961	1961
q21	5234	4598	4455	4455
q22	506	468	402	402
Total cold run time: 57066 ms
Total hot run time: 51855 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 171473 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c87535241e61b26ce1f4ee3cf2521b369deed1cd, data reload: false

query5	4307	627	489	489
query6	450	192	165	165
query7	4867	582	283	283
query8	333	181	173	173
query9	8764	4012	3990	3990
query10	432	315	261	261
query11	5803	2348	2146	2146
query12	159	101	99	99
query13	1245	601	442	442
query14	6263	5284	4932	4932
query14_1	4282	4285	4264	4264
query15	215	201	189	189
query16	991	403	441	403
query17	906	700	564	564
query18	2417	474	334	334
query19	192	184	146	146
query20	105	108	104	104
query21	215	134	118	118
query22	13617	13590	13602	13590
query23	17215	16603	16116	16116
query23_1	16413	16254	16247	16247
query24	7371	1748	1281	1281
query24_1	1335	1291	1286	1286
query25	532	441	371	371
query26	997	309	164	164
query27	2670	569	358	358
query28	4470	1999	2006	1999
query29	1036	655	522	522
query30	305	234	201	201
query31	1120	1076	945	945
query32	113	64	63	63
query33	537	320	262	262
query34	1201	1201	651	651
query35	762	786	707	707
query36	1375	1364	1227	1227
query37	155	108	95	95
query38	1888	1707	1677	1677
query39	921	907	899	899
query39_1	881	873	876	873
query40	219	128	105	105
query41	71	69	70	69
query42	92	89	88	88
query43	323	321	274	274
query44	1447	802	770	770
query45	201	197	178	178
query46	1134	1229	760	760
query47	2355	2334	2229	2229
query48	431	445	315	315
query49	593	430	317	317
query50	1050	363	263	263
query51	4435	4624	4369	4369
query52	85	82	71	71
query53	263	267	192	192
query54	282	231	209	209
query55	77	72	66	66
query56	242	230	223	223
query57	1442	1405	1323	1323
query58	250	222	224	222
query59	1556	1644	1433	1433
query60	294	260	237	237
query61	176	174	173	173
query62	699	661	589	589
query63	235	195	195	195
query64	2132	843	612	612
query65	4882	4814	4772	4772
query66	1752	452	342	342
query67	28853	28894	28688	28688
query68	3179	1636	973	973
query69	413	287	262	262
query70	1094	976	971	971
query71	289	232	211	211
query72	2957	2623	2275	2275
query73	855	826	451	451
query74	5148	4984	4744	4744
query75	2620	2527	2181	2181
query76	2331	1210	780	780
query77	348	373	285	285
query78	12389	12521	11922	11922
query79	1498	1134	788	788
query80	1231	447	399	399
query81	521	274	238	238
query82	592	150	123	123
query83	324	273	245	245
query84	307	144	114	114
query85	888	511	405	405
query86	423	294	282	282
query87	1850	1847	1764	1764
query88	3697	2823	2772	2772
query89	440	376	339	339
query90	1871	182	176	176
query91	170	154	128	128
query92	61	63	54	54
query93	1693	1454	939	939
query94	697	361	292	292
query95	659	463	338	338
query96	1012	795	352	352
query97	2689	2726	2564	2564
query98	213	209	197	197
query99	1177	1176	1019	1019
Total cold run time: 256243 ms
Total hot run time: 171473 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.23 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c87535241e61b26ce1f4ee3cf2521b369deed1cd, data reload: false

query1	0.00	0.00	0.00
query2	0.10	0.04	0.05
query3	0.25	0.14	0.14
query4	1.61	0.14	0.15
query5	0.26	0.22	0.23
query6	1.24	1.06	1.06
query7	0.03	0.01	0.00
query8	0.06	0.04	0.04
query9	0.38	0.31	0.33
query10	0.59	0.58	0.55
query11	0.19	0.14	0.13
query12	0.18	0.16	0.15
query13	0.46	0.45	0.47
query14	1.00	1.02	1.01
query15	0.61	0.58	0.59
query16	0.32	0.34	0.32
query17	1.11	1.11	1.11
query18	0.24	0.21	0.21
query19	1.99	1.97	1.97
query20	0.02	0.02	0.01
query21	15.45	0.22	0.14
query22	4.93	0.06	0.05
query23	16.15	0.31	0.12
query24	2.92	0.40	0.32
query25	0.13	0.06	0.04
query26	0.74	0.20	0.16
query27	0.04	0.03	0.04
query28	3.49	0.91	0.54
query29	12.50	4.28	3.43
query30	0.27	0.15	0.18
query31	2.78	0.60	0.31
query32	3.22	0.60	0.49
query33	3.23	3.33	3.17
query34	15.57	4.23	3.52
query35	3.51	3.53	3.51
query36	0.57	0.42	0.43
query37	0.09	0.07	0.06
query38	0.05	0.05	0.04
query39	0.04	0.03	0.03
query40	0.18	0.16	0.16
query41	0.08	0.04	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 96.66 s
Total hot run time: 25.23 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (7/7) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.19% (28494/38408)
Line Coverage 58.04% (310256/534511)
Region Coverage 54.72% (259080/473476)
Branch Coverage 56.07% (112574/200781)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants