Skip to content

Refact reader branch#64904

Draft
Gabriel39 wants to merge 57 commits into
apache:masterfrom
Gabriel39:refact_reader_branch
Draft

Refact reader branch#64904
Gabriel39 wants to merge 57 commits into
apache:masterfrom
Gabriel39:refact_reader_branch

Conversation

@Gabriel39

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Gabriel39 and others added 30 commits June 26, 2026 09:43
### What problem does this PR solve?

Issue Number: N/A

Related PR: N/A

Problem Summary: Refactor the file table reader stack around the format_v2 reader implementation. This includes the new file reader abstractions, parquet reader components, table reader adapters for Hive/Iceberg/Paimon/JDBC, ColumnMapper filter and projection handling, expression clone support used by file-local filter rewrites, Iceberg row lineage materialization, and related BE unit tests and design notes. This commit is the squashed result of rebasing the current branch onto master.

### Release note

None

### Check List (For Author)

- Test: No need to test (history rewrite only in this step; no code changes beyond the already rebased branch content)
- Behavior changed: No
- Does this need documentation: No
)

### What problem does this PR solve?

This PR refactors the new parquet reader complex-column path around
schema projection and reader creation.

It clarifies `ParquetColumnReaderFactory` recursion, keeps schema
projection as a single public helper, normalizes file-local
`ColumnDefinition` children to Doris semantic children, and folds the
Parquet MAP `key_value/entry` wrapper into the MAP schema node during
parquet schema construction. With that shape, MAP and LIST both expose
direct semantic children to the table/file reader boundary, and
ParquetReader no longer needs a semantic-to-physical projection
translation layer for MAP.

### Release note

None

### Check List

- Test: Manual test
    - Ran `git diff --check`.
- Tried `./run-be-ut.sh -j 8 --run
--filter="ParquetColumnReaderTest.ReadProjectedMapStructValueChildren:ColumnMapperScanRequestTest.MapValueStructProjectionPrunesValueChildren"`,
but local toolchain failed before tests with `ld: library 'c++' not
found`.
- Behavior changed: No
- Does this need documentation: No
…ns (apache#64480)

### What problem does this PR solve?

Localize slot-rooted struct element predicates through ColumnMapping so
renamed nested fields rewrite both selector names and projected file
child return types. Keep computed complex parents and evolved
MAP_KEYS-style filters at the table layer instead of generating unsafe
file-local casts.
    
Rebuild complex scan projections and rematerialize struct children in
table type order, add debug-only block sanity checks with contextual
errors, preserve function-call clone state, and handle nullable Iceberg
delete columns.
    
Move ColumnMapper coverage into column_mapper_test and add tests for
complex child projection, nested predicate projection, map/array
fallback, and file-local conjunct localization boundaries.
### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
### What problem does this PR solve?

Issue Number: None

Related PR: apache#63893

Problem Summary: Refresh regression expected outputs for the new Parquet INT96 timestamp semantics. The new reader decodes INT96 timestamps without applying the session timezone offset, so affected Asia/Shanghai-based expectations move 8 hours earlier. This commit updates only the expected result files for the INT96 offset cases in p0 and external suites, while leaving unrelated timestamp failures unchanged.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Verified staged diffs are limited to regression expected-output files.
    - Verified 634 changed lines are exactly timestamp minus 8 hours, with paimon c2 handled as a selective LTZ/INT96 column update.
    - Full regression test not run.
- Behavior changed: No, test expectations only
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: apache#63893

Problem Summary: The new Parquet reader rejected MAP schemas whose key field is optional with `Unsupported nullable parquet MAP key`, while the old reader only logged a warning and continued. Some external Parquet writers can emit optional MAP keys, and the v2 schema builder already preserves definition levels and exposes MAP key/value types as nullable.

This change removes the schema-level hard rejection for optional MAP keys while keeping the existing structural MAP layout checks.

### Release note

Allow the new Parquet reader to read external Parquet MAP columns with optional key fields.

### Check List (For Author)

- Test: Manual test
    - `build-support/clang-format.sh be/src/format_v2/parquet/parquet_column_schema.cpp`
    - `git diff --check`
- Behavior changed: Yes, the new Parquet reader no longer rejects optional MAP key schemas.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: apache#63893

Problem Summary: RuntimeFilterExpr is a wrapper around the concrete runtime filter predicate, but its normal column execution path returned `Not implement RuntimeFilterExpr::execute_column_impl`. Partition pruning for external tables can evaluate runtime filter wrappers as ordinary expressions, so Hive/Iceberg/Paimon runtime filter partition pruning failed before evaluating the wrapped predicate.

This change delegates RuntimeFilterExpr::execute_column_impl to the wrapped implementation. The scan filter path still uses execute_filter, preserving the existing selectivity counters and runtime-filter-specific row filtering behavior.

### Release note

Fix runtime filter wrapper expression execution for external table partition pruning.

### Check List (For Author)

- Test: Manual test
    - `build-support/clang-format.sh be/src/exprs/runtime_filter_expr.cpp`
    - `git diff --check`
- Behavior changed: Yes, runtime filter wrappers can now be evaluated through the normal expression execution path.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#63893

Problem Summary: The new parquet reader resolved decimal columns with precision greater than 38 to TYPE_DECIMAL256, but then marked those columns as unsupported for the scalar record reader. This made external parquet scans fail with errors such as "Current parquet scalar reader does not support column amount" even though the old parquet reader and the decoded decimal serde already support Decimal256 for the common parquet decimal physical carriers. Remove the extra precision-based block so Decimal256 columns can use the existing record reader and decoded serde path, while unsupported physical types remain rejected.

### Release note

Support Decimal256 parquet columns in the new parquet reader.

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh be/src/format_v2/parquet/parquet_type.cpp
    - Ran git diff --check
- Behavior changed: Yes (new parquet reader now accepts parquet decimal columns with precision greater than 38 when they fit Doris Decimal256)
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#63893

Problem Summary: The new parquet reader created the FilteredRowsByLazyRead profile counter but did not pass it to the scan scheduler or update it after predicate filtering. As a result lazy materialization profile tests saw zero filtered lazy-read rows even when non-predicate columns were read lazily. Pass the counter into ParquetScanScheduler and update it by the number of rows filtered by conjuncts when non-predicate columns are present.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh be/src/format_v2/parquet/parquet_scan.h be/src/format_v2/parquet/parquet_scan.cpp be/src/format_v2/parquet/parquet_reader.cpp
    - Ran git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#63893

Problem Summary: Clean up format_v2 code style after the new parquet reader changes. This change fixes local tidy-style issues in the parquet selection vector by using std::cmp_greater/std::cmp_greater_equal for mixed signed and unsigned comparisons and designated initializers for RowRange construction. It also removes unused include directives from format_v2 implementation files. The format_v2 directory was checked with the repository clang-format script, and the include cleanup was validated by compiling the Fedora DEBUG Format target.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh be/src/format_v2
    - Ran git diff --check
    - Ran targeted clang-tidy on Fedora with checks modernize-use-integer-sign-comparison and modernize-use-designated-initializers for be/src/format_v2/parquet/selection_vector.h using the DEBUG compile database without PCH
    - Ran clang-include-cleaner on selected format_v2 implementation files to collect remove candidates
    - Ran /home/socrates/ldb_toolchain/bin/ninja Format in /home/socrates/code/doris/be/build_DEBUG on Fedora
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#63893

Problem Summary: TeamCity external regression build 970191 still had several expected output files using old timestamp values. The new parquet timestamp semantics return the corrected values for the affected external table cases, including Hive, Iceberg, Paimon, and TVF parquet result files. This commit refreshes the corresponding regression expected outputs from the observed CI results and keeps unrelated non-timestamp failures untouched.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Compared TeamCity build 970191 failure details with the updated expected output files. Full regression test was not rerun locally.
- Behavior changed: No
- Does this need documentation: No
struct type should be use name mode with nested type
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: New parquet profile definitions and wiring were split across ParquetReader, ParquetScan, and column reader headers. This made ParquetReader own counter initialization, pruning counter updates, and scheduler sub-profile assembly directly even though parquet_profile.h already existed for profile-related types. This change centralizes the new parquet RuntimeProfile counter ownership in parquet_profile.h/.cpp and keeps ParquetReader responsible only for invoking the profile helper methods.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh for touched files.
    - Ran git diff --check.
    - Tried ./run-be-ut.sh --run '--filter=NewParquetReaderTest.*', but local CMake compiler detection failed before building Doris because /opt/homebrew/opt/llvm@16/bin/clang++ could not link a simple program: ld: library 'c++' not found.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Align format_v2 implementation namespaces with the format_v2 ownership boundary. Parquet, Hive, Paimon, Iceberg, and JDBC implementations now live under doris::format subnamespaces, while shared format_v2 expression helpers live under doris::format. Call sites and tests were updated to use the new namespace layout.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh on the modified BE files
    - Ran git diff --check
    - Ran namespace residue scans for old doris::parquet/hive/paimon/jdbc/iceberg namespaces and duplicate format::format references
    - Attempted targeted BE UT with ./run-be-ut.sh --run '--filter=NewParquetReaderTest.*:ParquetColumnReaderTest.*:TableReaderTest.*:CastTest.*:DeletePredicateTest.*:EqualityDeletePredicateTest.*', but local CMake compiler detection failed before Doris code compiled because /opt/homebrew/opt/llvm@16/bin/clang++ could not link libc++: ld: library 'c++' not found
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#63893

Problem Summary: The new parquet reader reports timestamp values with the updated INT96 timestamp interpretation for existing external parquet coverage. This commit updates the affected regression expected outputs from the latest TeamCity P0 and external regression real outputs. Doris parquet export/write cases with suspicious timestamp offsets are intentionally excluded because those require separate writer-side analysis.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Validated modified expected rows against TeamCity builds 970619 and 970620 failure logs, and ran `git diff --check`.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#63893

Problem Summary: The new parquet reader did not map TIMESTAMP(NANOS) logical columns to a supported Doris timestamp type, and DATETIMEV2 decoded INT64 timestamp values only handled millis and micros. As a result Hive parquet timestamp nanos data was materialized as NULL instead of the expected timestamp values. This change maps parquet timestamp nanos to DATETIMEV2(6), decodes nanos by truncating to microseconds, and adds decoded-value coverage for DATETIMEV2 nanos. It also refreshes the external TVF group4 expected output for a parquet file containing BC timestamp values that Doris cannot represent, where the new reader correctly returns NULL for those rows.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran `git diff --check`.
    - Verified the relevant parquet files with DuckDB to confirm timestamp nanos and BC timestamp source values.
    - Attempted `./run-be-ut.sh --run '--filter=DataTypeSerDeDecodedValuesTest.*'`, but local CMake failed before compiling tests because the macOS toolchain cannot link a simple C++ program: `ld: library 'c++' not found`.
- Behavior changed: Yes. The new parquet reader now reads TIMESTAMP(NANOS) values as DATETIMEV2(6) instead of producing NULL through unsupported conversion.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Refine the new Parquet reader row group pruning flow so scan range filtering is applied before more expensive statistics, dictionary, bloom filter, and page index pruning. Also document the Parquet reader, scan scheduler, statistics pruning, and nested column reader APIs, and update affected namespace references in BE tests.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh on touched BE C++ files and git diff --check locally.
    - Started BE UT validation on Fedora with NewParquetReaderTest.* and ParquetBloomFilterPruningTest.*; fixed compile issues found during validation. Full rerun was interrupted before completion by follow-up history cleanup request.
- Behavior changed: No
- Does this need documentation: No
Rewrite comments for the entry-point and foundational modules:

parquet_reader.h:
- Class-level doc: role boundary, lifecycle (init→get_schema→open→get_block→close)
- TableReader calling relationship explained
- Each method and field annotated

parquet_type.h:
- ParquetExtraTypeInfo: each variant documented
- ParquetTypeDescriptor: full field-by-field descriptions
- Three-level resolution priority (logical→converted→physical) explained
- resolve_parquet_type / supports_record_reader / decoded_value_kind docs

parquet_column_schema.h:
- Class-level doc: design decisions (wrapper folding, nullable, Dremel levels)
- All fields grouped into sections (identifier / type / levels / children)
- Each field annotated with its role and valid domain (PRIMITIVE vs complex)

parquet_column_schema.cpp:
- SchemaBuildContext fields annotated

parquet_file_context.cpp:
- DorisRandomAccessFile adapter class documented

parquet_profile.h:
- All Profile structs with section-based Chinese comments
- Counter groups organized (RG pruning / page skip / batch read / column reader /
  file ops / decompress & cache / decode / others)

Co-Authored-By: Claude <noreply@anthropic.com>
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The new Parquet scanner does not implement condition cache yet, so the parquet condition cache regression case can fail when the session uses FileScannerV2. Force this case to use the old file scanner path before enabling condition cache and profile checks.

### Release note

None

### Check List (For Author)

- Test: No need to test
    - Regression-only session variable adjustment for an existing case.
- Behavior changed: No
- Does this need documentation: No
Gabriel39 and others added 12 commits June 26, 2026 09:43
Problem Summary: The TimeStampTz protobuf unit test constructed the type
with auto, so the static type was std::shared_ptr<DataTypeTimeStampTz>.
Calling to_protobuf(PTypeDesc*) through that concrete type failed
because the three-argument override hides the base-class one-argument
wrapper. Use DataTypePtr so the test exercises the IDataType public
interface that owns the one-argument wrapper.
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Bare repeated parquet fields can encode an empty repeated parent with a low definition-level shape slot. The nested LIST and MAP builders filtered out every slot whose definition level was below the repeated ancestor level before accounting for parent rows, so an empty repeated primitive field could report zero output rows for a row group that should produce one row. Preserve low-definition slots that start a parent row while still skipping unrelated nested slots, and keep MAP scalar value alignment consistent with the key stream.

### Release note

None

### Check List (For Author)

- Test: Regression test / Unit Test
    - Regression test: on Fedora, ran external_table_p0/tvf/test_hdfs_parquet_group2 after rebuilding DEBUG BE; the original failing test_10 proto-struct-with-array.parquet returned the expected row with repeatedprimitive=[]. The full suite then failed later at unrelated timestamp-nanos expected output.
    - Unit Test: added ParquetColumnReaderControlTest cases for empty bare repeated primitive LIST rows and empty MAP rows. Background Fedora UT compilation was started but not confirmed complete because SSH to Fedora timed out.
- Behavior changed: Yes. Empty repeated parquet LIST/MAP parent rows are preserved instead of being dropped.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: The new Parquet reader passed small-width logical integer annotations such as INT(8, false) and INT(16, false) to numeric SerDe as plain INT32 values. Physical values that should be interpreted with Parquet logical bit-width cast semantics could then be preserved or nulled by Doris target-type range checks instead of matching Arrow and the old reader behavior. This change carries logical integer bit width and signedness through DecodedColumnView, applies the logical cast before Doris materialization, and uses the same metadata when converting Parquet statistics min/max fields.

### Release note

Fix Parquet logical integer decoding for small-width signed and unsigned annotations in the new reader.

### Check List (For Author)

- Test: Unit Test / Manual test
    - Added decoded-value unit tests for signed and unsigned Parquet logical integer casts, including the field path used by statistics.
    - Ran git diff --check.
    - Attempted ./run-be-ut.sh --run --filter='DataTypeSerDeDecodedValuesTest.ReadUnsignedLogicalIntegersCastsPhysicalValues:DataTypeSerDeDecodedValuesTest.ReadSignedLogicalIntegersCastsPhysicalValues:DataTypeSerDeDecodedValuesTest.ReadFieldLogicalIntegerCastsPhysicalValue', but local macOS toolchain failed during CMake compiler detection with ld: library 'c++' not found before building Doris tests.
- Behavior changed: Yes. New Parquet reader now matches Parquet logical integer bit-width cast semantics for annotated integers.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: apache#63781, apache#64671

Problem Summary: File scanner v2 did not carry the same fixes as the
existing file scanner path. Predicate rows filtered inside v2 file
readers were still reported through scanner load counters unless the
scanner was a real load source, and Hive TEXTFILE empty physical lines
were still skipped unless read_csv_empty_line_as_null was enabled. This
change gates v2 load counter reporting with the same FILE_STREAM
exception used by FileScanner and adds a delimited text hook so Hive
Text v2 treats empty physical lines as records while CSV keeps the old
default behavior.

### Release note

Fix file scanner v2 load counter reporting and Hive TEXTFILE empty-line
handling.

### Check List (For Author)

- Test: Unit Test / Manual test
- Added TextV2ReaderTest coverage for Hive TEXTFILE empty line records,
single-column empty string fields, and COUNT pushdown.
    - Ran git diff --check.
- Ran clang-format v16 through build-support/run_clang_format.py for
changed files.
- Attempted ./run-be-ut.sh --run
--filter='TextV2ReaderTest.*:FileScannerV2Test.*', but the local run was
blocked because the script needed to update/download datasketches-cpp
and network access was unavailable; no BE UT binary was already built.
- Attempted clang-tidy with the available compile_commands.json, but it
pointed at a stale /mnt/disk3/gabriel path; the project clang-tidy
wrapper also requires bash 4+ while only system bash is available.
- Behavior changed: Yes. File scanner v2 now matches v1 load counter
gating and Hive TEXTFILE empty-line semantics.
- Does this need documentation: No

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: File scanner v2 reads Parquet through Arrow, so the old
vparquet page cache path is not used. Repeated scans still go through
the Doris file reader for serialized Parquet column chunk data even when
the Parquet page cache option is enabled. This change registers the
selected Parquet column chunk byte ranges after row-group planning and
lets the Arrow RandomAccessFile adapter reuse StoragePageCache for reads
inside those ranges. Footer and metadata reads happen before range
registration and are intentionally excluded.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check.
- Ran build-support/run_clang_format.py with clang-format 16 on modified
BE files.
- Could not compile with existing be/cmake-build-debug-dev-perf because
CMakeCache.txt was generated for /mnt/disk3/gabriel/Workspace/dev1/doris
and the configured ninja path is not available in this worktree.
- Behavior changed: No
- Does this need documentation: No

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: FileScannerV2 did not support Doris Native files. This
change adds a native v2 FileReader implementation instead of wrapping
the legacy NativeReader. The reader validates the Native header, reads
serialized PBlock payloads, caches and replays the first block for
schema probing, exposes nullable file-local schema, projects requested
columns, and applies file-local filters. Shared materialized-column
filtering is also used by JSON and delimited text readers so predicate
accounting stays consistent. WAL is intentionally not implemented on the
v2 path because current group commit WAL scans are load scans and
FileScanOperator only selects FileScannerV2 when src_tuple_id does not
resolve to an input tuple.

### Release note

None

### Check List (For Author)

- Test:
    - Style check: build-support/check-format.sh
- Unit Test: not run locally because sandbox execution cannot write
.git/modules for submodule setup and cannot download datasketches-cpp;
the attempted run-be-ut command failed before compiling tests.
- Behavior changed: No
- Does this need documentation: No

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: FileScannerV2 could not read Remote Doris Arrow Flight splits because FORMAT_ARROW was not routed to a v2 table reader and no v2-native Remote Doris reader existed. This change adds a Remote Doris TableReader/FileReader implementation for FileScannerV2 that opens Arrow Flight streams directly, builds the file-local schema from planned file slots, materializes Arrow RecordBatch data by column name into the v2 file-local block, applies localized filters through the v2 materialized-reader helper, validates protocol mismatches, and closes Flight resources. FORMAT_ARROW is enabled in FileScannerV2 only for table_format_type=remote_doris so ordinary Arrow stream files stay on the existing path.

### Release note

Support Remote Doris scans in FileScannerV2 when FileScannerV2 is enabled.

### Check List (For Author)

- Test: Manual test
    - BE unit test: attempted PARALLEL=1 ./run-be-ut.sh --run --filter='FileScannerV2Test.*:RemoteDorisV2ReaderTest.*', but the sandbox could not update .git/modules/contrib/datasketches-cpp and network fallback to github.com was unavailable; escalated retries timed out in approval review.
    - Manual test: python3 build-support/run_clang_format.py --clang-format-executable /usr/local/opt/llvm@16/bin/clang-format --style file --inplace false --extensions c,h,C,H,cpp,hpp,cc,hh,c++,h++,cxx,hxx --exclude none <modified files>
- Behavior changed: Yes. Remote Doris FORMAT_ARROW scan ranges can be routed to FileScannerV2.
- Does this need documentation: No
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Gabriel39

Copy link
Copy Markdown
Contributor Author

run buildall

@Gabriel39 Gabriel39 marked this pull request as draft June 26, 2026 13:48
@hello-stephen

Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 77.38% (1895/2449)
Line Coverage 64.45% (33999/52753)
Region Coverage 64.86% (17494/26973)
Branch Coverage 54.06% (9377/17344)

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29346 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4ec377cdd4adcdcabe27109438b0074f27b236dd, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17297	4087	4014	4014
q2	1995	310	185	185
q3	10248	1409	832	832
q4	4675	466	337	337
q5	7609	844	578	578
q6	184	165	135	135
q7	769	840	625	625
q8	9388	1708	1528	1528
q9	5536	4471	4456	4456
q10	6748	1782	1517	1517
q11	435	284	237	237
q12	627	417	298	298
q13	18880	3346	2836	2836
q14	271	272	257	257
q15	q16	801	794	727	727
q17	963	938	1006	938
q18	7047	6044	5703	5703
q19	1171	1214	1134	1134
q20	496	399	274	274
q21	5715	2681	2435	2435
q22	436	355	300	300
Total cold run time: 101291 ms
Total hot run time: 29346 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4368	4285	4273	4273
q2	320	338	220	220
q3	4600	4917	4379	4379
q4	2055	2138	1378	1378
q5	4403	4299	4292	4292
q6	235	172	128	128
q7	1744	1615	1891	1615
q8	2528	2210	2099	2099
q9	8233	8052	8105	8052
q10	4830	4761	4313	4313
q11	584	415	378	378
q12	747	753	539	539
q13	3306	3524	2993	2993
q14	326	323	270	270
q15	q16	745	763	632	632
q17	1365	1329	1375	1329
q18	8158	7461	7077	7077
q19	1139	1087	1133	1087
q20	2277	2255	1983	1983
q21	5252	4559	4469	4469
q22	519	455	424	424
Total cold run time: 57734 ms
Total hot run time: 51930 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 171893 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 4ec377cdd4adcdcabe27109438b0074f27b236dd, data reload: false

query5	4318	635	480	480
query6	451	197	173	173
query7	4825	550	297	297
query8	350	184	169	169
query9	8780	4104	4117	4104
query10	451	309	284	284
query11	5697	2382	2129	2129
query12	151	103	124	103
query13	1226	603	431	431
query14	6268	5346	4930	4930
query14_1	4332	4283	4309	4283
query15	225	201	177	177
query16	998	457	431	431
query17	941	714	580	580
query18	2428	477	337	337
query19	200	186	139	139
query20	111	107	102	102
query21	218	147	116	116
query22	13671	13612	13494	13494
query23	17415	16470	16118	16118
query23_1	16310	16254	16475	16254
query24	7338	1837	1312	1312
query24_1	1344	1324	1329	1324
query25	572	472	412	412
query26	1343	315	173	173
query27	2651	546	336	336
query28	4425	2031	2029	2029
query29	1095	657	504	504
query30	316	239	196	196
query31	1133	1107	954	954
query32	109	62	61	61
query33	532	328	258	258
query34	1169	1163	673	673
query35	791	778	690	690
query36	1377	1353	1178	1178
query37	158	112	97	97
query38	1909	1728	1682	1682
query39	946	942	902	902
query39_1	899	874	875	874
query40	217	129	108	108
query41	73	69	71	69
query42	93	96	90	90
query43	338	338	292	292
query44	1508	791	788	788
query45	210	196	185	185
query46	1103	1248	722	722
query47	2311	2318	2182	2182
query48	406	413	305	305
query49	578	455	316	316
query50	987	370	261	261
query51	4562	4451	4294	4294
query52	83	82	69	69
query53	253	265	191	191
query54	268	231	214	214
query55	76	70	69	69
query56	248	237	220	220
query57	1446	1384	1308	1308
query58	236	204	217	204
query59	1611	1672	1440	1440
query60	284	243	224	224
query61	158	150	151	150
query62	696	654	584	584
query63	225	192	196	192
query64	2518	763	637	637
query65	4883	4819	4771	4771
query66	1786	466	351	351
query67	29190	28889	28853	28853
query68	3209	1526	983	983
query69	398	312	274	274
query70	1062	941	979	941
query71	291	241	216	216
query72	3133	2678	2386	2386
query73	840	754	420	420
query74	5152	4953	4763	4763
query75	2614	2554	2185	2185
query76	2242	1286	826	826
query77	358	398	294	294
query78	12430	12395	11841	11841
query79	1428	1242	804	804
query80	600	469	381	381
query81	457	283	244	244
query82	564	159	123	123
query83	351	280	247	247
query84	306	150	117	117
query85	869	561	442	442
query86	376	318	311	311
query87	1890	1820	1782	1782
query88	3743	2804	2768	2768
query89	449	401	337	337
query90	1861	180	183	180
query91	166	164	134	134
query92	66	61	60	60
query93	1519	1584	865	865
query94	538	366	285	285
query95	678	384	353	353
query96	1122	751	330	330
query97	2707	2686	2574	2574
query98	227	208	202	202
query99	1185	1151	1056	1056
Total cold run time: 257124 ms
Total hot run time: 171893 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.98 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 4ec377cdd4adcdcabe27109438b0074f27b236dd, data reload: false

query1	0.01	0.01	0.01
query2	0.13	0.08	0.08
query3	0.37	0.24	0.24
query4	1.60	0.24	0.25
query5	0.33	0.31	0.31
query6	1.15	0.67	0.67
query7	0.04	0.01	0.00
query8	0.11	0.08	0.08
query9	0.50	0.38	0.39
query10	0.58	0.58	0.57
query11	0.32	0.19	0.18
query12	0.32	0.19	0.19
query13	0.53	0.54	0.52
query14	0.93	0.93	0.92
query15	0.69	0.60	0.60
query16	0.38	0.39	0.39
query17	0.99	1.01	1.02
query18	0.30	0.30	0.29
query19	1.94	1.83	1.82
query20	0.02	0.02	0.01
query21	15.39	0.38	0.32
query22	4.78	0.13	0.14
query23	15.81	0.50	0.30
query24	2.41	0.62	0.43
query25	0.15	0.10	0.10
query26	0.75	0.27	0.22
query27	0.11	0.10	0.10
query28	3.49	0.94	0.54
query29	12.46	4.45	3.52
query30	0.37	0.26	0.25
query31	2.76	0.63	0.34
query32	3.24	0.62	0.48
query33	2.92	2.98	2.99
query34	15.82	4.10	3.37
query35	3.28	3.29	3.29
query36	0.63	0.52	0.51
query37	0.12	0.09	0.10
query38	0.08	0.07	0.07
query39	0.08	0.06	0.07
query40	0.20	0.17	0.17
query41	0.13	0.09	0.08
query42	0.09	0.06	0.06
query43	0.08	0.06	0.07
Total cold run time: 96.39 s
Total hot run time: 25.98 s

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 66.67% (4/6) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 50.00% (3/6) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants