Skip to content

[improvement](regression) Use Spark thrift JDBC for external SQL helpers #64886

Open
zgxme wants to merge 11 commits into
apache:masterfrom
zgxme:spark-sql-0626
Open

[improvement](regression) Use Spark thrift JDBC for external SQL helpers #64886
zgxme wants to merge 11 commits into
apache:masterfrom
zgxme:spark-sql-0626

Conversation

@zgxme

@zgxme zgxme commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

This PR improves the execution efficiency of external SQL helper related regression cases by using Spark thrift JDBC access.

Local validation shows the following cache-related cases are significantly faster after this change:

  • test_iceberg_table_cache: 3m20s -> 30s
  • test_paimon_table_meta_cache: 14m59s -> 40s

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

zgxme added 5 commits June 26, 2026 15:01
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#63719

Problem Summary: The regression Spark Iceberg and Paimon helpers executed SQL through docker exec and spark-sql, which required local Docker access and repeatedly started Spark SQL clients. This change follows the Spark Iceberg JDBC helper approach from PR apache#63719 and routes Spark Iceberg/Paimon helper execution through Spark ThriftServer with Hive JDBC. Multi-statement execution now reuses one JDBC connection.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - mvn -q -DskipTests compile under regression-test/framework
    - git diff --check -- framework/src/main/groovy/org/apache/doris/regression/suite/Suite.groovy
- Behavior changed: Yes. spark_iceberg, spark_iceberg_multi, and spark_paimon now execute through Spark ThriftServer JDBC instead of docker exec spark-sql.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Spark Iceberg helpers opened a new Hive JDBC connection for every spark_iceberg/spark_paimon call. This added repeated Spark ThriftServer session setup overhead in suites that issue many Spark SQL statements. The framework now keeps a Spark Iceberg JDBC connection in SuiteContext thread-local state, creates it on first use, reuses it for later calls in the same suite context thread, and closes it with other context thread-local resources.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Manual test: mvn package -B -DskipTests=true -Dmaven.javadoc.skip=true in regression-test/framework; git diff --check
- Behavior changed: Yes. Spark Iceberg/Paimon helper SQL reuses a SuiteContext-local Spark JDBC connection instead of opening one per call.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: The Iceberg docker entrypoint started Spark master and worker before Spark ThriftServer, but the thriftserver command did not specify a Spark master. Without an explicit master, Spark can fall back to local execution, so the standalone master and worker may not be used by Hive JDBC queries. This change starts Spark ThriftServer with --master spark://doris--spark-iceberg:7077 while keeping the Derby system home JVM option unchanged.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Manual test: bash -n docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl
- Behavior changed: Yes. Iceberg Spark ThriftServer now explicitly runs against the standalone Spark master in the docker environment.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: The Iceberg Spark docker environment relied on Spark defaults for ThriftServer and spark-sql resource sizing. Those defaults can use too many CPU cores while leaving executor and driver heap at small defaults, and the default shuffle partition count is high for local regression data. This change caps the Spark app at 8 cores, uses 4-core executors with 8g heap, gives the driver 4g heap, disables dynamic allocation explicitly, and reduces default shuffle/parallelism settings for local regression stability.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Manual test: git diff --check -- docker/thirdparties/docker-compose/iceberg/spark-defaults.conf
- Behavior changed: Yes. Iceberg Spark docker jobs now use explicit resource and parallelism defaults.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: The Iceberg docker entrypoint started Spark ThriftServer before running the preinstalled Spark SQL setup scripts. After moving ThriftServer onto the standalone master, that idle ThriftServer app can reserve executor resources while setup scripts are still running. The ThriftServer also did not receive Iceberg/Paimon SQL extensions, while regression helpers execute Spark SQL through Hive JDBC. This change runs the setup scripts first, then starts ThriftServer with Iceberg and Paimon extensions, and waits for Hive JDBC readiness before marking the container healthy.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Manual test: bash -n docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl; /bin/sh -n docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl; git diff --check -- docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl
- Behavior changed: Yes. Iceberg Spark ThriftServer starts after preinstalled data setup and waits for JDBC readiness before /mnt/SUCCESS.
- Does this need documentation: No
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Spark 4 thriftserver rejects the previous noSasl JDBC URL and then fails to open sessions against the default Iceberg namespace because demo.default is not created. This makes the Iceberg docker startup loop on the thriftserver readiness check and prevents regression Spark Iceberg JDBC helpers from connecting. Create the default Iceberg namespace before starting thriftserver, use the normal HiveServer2 JDBC URL without auth=noSasl, and fail readiness with useful logs instead of looping forever.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran bash -n and /bin/sh -n for docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl
    - Ran git diff --check for modified files
    - Ran mvn package -B -DskipTests=true -Dmaven.javadoc.skip=true in regression-test/framework
- Behavior changed: No
- Does this need documentation: No
@zgxme

zgxme commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

run buildall

1 similar comment
@Gabriel39

Copy link
Copy Markdown
Contributor

run buildall

yiguolei
yiguolei previously approved these changes Jun 26, 2026
@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 26, 2026
@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

Issue Number: None

Related PR: None

Problem Summary: Add P2 demo regression cases for Iceberg and Paimon. The cases write data through Spark SQL first, then query the same external table through both Doris and Spark, normalizing JDBC result values before comparison to avoid false failures caused by different Java number classes returned by the two JDBC drivers.

None

- Test: Regression test
    - ./run-regression-test.sh --run -d external_table_p2/iceberg -s test_iceberg_spark_doris_consistency_demo
    - ./run-regression-test.sh --run -d external_table_p2/paimon -s test_paimon_spark_doris_consistency_demo
- Behavior changed: No
- Does this need documentation: No
@zgxme zgxme requested a review from yiguolei June 26, 2026 11:11
@github-actions github-actions Bot removed the approved Indicates a PR has been approved by one committer. label Jun 26, 2026
@Gabriel39

Copy link
Copy Markdown
Contributor

run buildall

zgxme added 4 commits June 27, 2026 19:53
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Paimon preinstalled SQL scripts are executed in a shared Spark SQL session. run06.sql changes the session time zone to +08:00 for timestamp partition coverage, but did not restore it before subsequent scripts. This can make later Paimon bootstrap data depend on session state and change physical file metadata such as partition file size. Restore the session time zone to UTC at the end of run06.sql so later scripts start from the default time zone.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check -- docker/thirdparties/docker-compose/iceberg/scripts/create_preinstalled_scripts/paimon/run06.sql
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: The Iceberg docker bootstrap was changed to sort preinstalled SQL script paths before generating the Spark SQL source files, and run06.sql restored the session time zone after its timestamp partition setup. Revert those changes so the bootstrap ordering and Paimon setup SQL match the previous behavior while investigating Paimon partition file size differences.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check -- docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl docker/thirdparties/docker-compose/iceberg/scripts/create_preinstalled_scripts/paimon/run06.sql
- Behavior changed: Yes. Iceberg docker preinstalled SQL path handling returns to the prior unsorted find output behavior, and run06.sql no longer restores session time zone.
- Does this need documentation: No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants