Skip to content

Delta pipeline fix tests#12380

Closed
felipepessoto wants to merge 3 commits into
apache:mainfrom
felipepessoto:delta_pipeline_fix_tests
Closed

Delta pipeline fix tests#12380
felipepessoto wants to merge 3 commits into
apache:mainfrom
felipepessoto:delta_pipeline_fix_tests

Conversation

@felipepessoto

Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

…es baseline

Run delta-io/delta's `spark` ScalaTest suite against a Gluten Velox bundle in CI
and gate the results against a committed baseline so the many expected Delta-on-
Gluten failures stay manageable and can be fixed incrementally without letting
currently-passing tests silently regress.

What it adds (.github/workflows/util/delta-spark-ut/):
- delta_spark_ut.yml: builds the native lib + Gluten bundle, then runs the Delta
  spark suite sharded by suite into 4 shards x 4 forked test JVMs (~16-way), and
  gates each shard against the baseline.
- compare-test-results.py: the gate. Per shard, regressions (failed not in the
  baseline) fail the build; newly-passing baselined tests are flagged so the
  baseline can be tightened. Also supports seed/aggregate modes.
- known-failures.txt: the committed baseline of expected failures.
- setup-delta.sh: clones Delta, injects the Gluten bundle, patches
  DeltaSQLCommandTest, and force-fails the two DeletionVectorsSuite 2B-row tests
  whose native row-index materialization OOM-kills the runner and hangs the shard.
- README.md: how the pipeline, gating and baseline-refresh work.

The workflow also carries a hang watchdog that thread-dumps and kills a wedged
fork, and tunes the per-fork heap (2G) and off-heap (2G) to fit the ~16G runner.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@felipepessoto felipepessoto force-pushed the delta_pipeline_fix_tests branch from 6953c7f to 987abe4 Compare June 26, 2026 16:08
@felipepessoto felipepessoto force-pushed the delta_pipeline_fix_tests branch from 62a9d53 to 08c146a Compare June 27, 2026 05:01
@github-actions github-actions Bot added the VELOX label Jun 27, 2026
@felipepessoto felipepessoto force-pushed the delta_pipeline_fix_tests branch 2 times, most recently from 730c6ef to 8f7f17b Compare June 27, 2026 08:24
felipepessoto and others added 2 commits June 27, 2026 08:24
Velox has no Arrow representation for VariantType, so the native columnar write
path -- which converts the incoming rows to Velox batches via
RowToVeloxColumnarExec.toArrowSchema -- throws
`UnsupportedOperationException: Unsupported data type: variant` at runtime. This
broke every Delta write whose schema contains a variant column (INSERT, UPDATE,
MERGE, OPTIMIZE/auto-compact, checkpoint-driven rewrites), since
GlutenOptimisticTransaction.writeFiles always offloaded the write to the native
writer (the now-removed code path built the Velox plan unconditionally).

Guard GlutenOptimisticTransaction.writeFiles: if the input schema contains a
variant at any nesting level, delegate to super.writeFiles (the vanilla Delta
write path) instead of offloading. Non-variant writes are unaffected. The check
matches by type name so it stays source-compatible across Spark versions.

Adds GlutenDeltaVariantWriteSuite covering top-level, struct-nested, and UPDATE
variant writes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@felipepessoto felipepessoto force-pushed the delta_pipeline_fix_tests branch from 8f7f17b to d9291ba Compare June 27, 2026 08:24
@felipepessoto felipepessoto closed this by deleting the head repository Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant