Skip to content

[FLINK-39991] Replace Flink ReplaceIntersectWithSemiJoinRule with Cal…#28533

Open
liuyongvs wants to merge 1 commit into
apache:masterfrom
liuyongvs:flink-39991
Open

[FLINK-39991] Replace Flink ReplaceIntersectWithSemiJoinRule with Cal…#28533
liuyongvs wants to merge 1 commit into
apache:masterfrom
liuyongvs:flink-39991

Conversation

@liuyongvs

Copy link
Copy Markdown
Contributor

What is the purpose of the change

This is follow up PR after Calcite upgrade to 1.40.0

There 1 Calcite classes which might be removed from Flink codebase

Brief change log

Replace ReplaceIntersectWithSemiJoinRule with Calcite IntersectToSemiJoinRule

Verifying this change

existing tests

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): ( no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? ( no)
  • If yes, how is the feature documented? (not applicable)

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

@flinkbot

flinkbot commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

<Resource name="optimized exec plan">
<![CDATA[
Join(joinType=[LeftSemiJoin], where=[IS NOT DISTINCT FROM(random, random0)], select=[random], leftInputSpec=[JoinKeyContainsUniqueKey], rightInputSpec=[NoUniqueKey])
Join(joinType=[LeftSemiJoin], where=[((random IS NULL AND random0 IS NULL) OR (random = random0))], select=[random], leftInputSpec=[JoinKeyContainsUniqueKey], rightInputSpec=[NoUniqueKey])

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments: IS NOT DISTINCT FROM(random, random0) is equal to ((random IS NULL AND random0 IS NULL) OR (random = random0))

@liuyongvs liuyongvs Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Root Cause Analysis of XML Test Differences
The Difference
After replacing Flink's ReplaceIntersectWithSemiJoinRule with Calcite's IntersectToSemiJoinRule, the join condition in XML test files changed from:

IS NOT DISTINCT FROM($0, $1)
to:
OR(AND(IS NULL($0), IS NULL($1)), =($0, $1))
Both forms are semantically equivalent.

Root Cause: collapseExpandedIsNotDistinctFromExpr Fails Due to Asymmetric removeNullabilityCast
The key difference between the two rules is how they construct the join condition, which determines whether RelBuilder.join() can collapse the expanded form back to IS NOT DISTINCT FROM.

Old Rule (Flink) — Collapse Succeeds
ReplaceIntersectWithSemiJoinRule calls SetOpRewriteUtil.generateEqualsCondition(), which constructs conditions using bare makeInputRef:

// SetOpRewriteUtil.java:59-71
RexNode leftRex  = rexBuilder.makeInputRef(leftTypes.get(key), key);
RexNode rightRex = rexBuilder.makeInputRef(rightTypes.get(key), leftTypes.size() + key);
relBuilder.or(
    rexBuilder.makeCall(EQUALS, leftRex, rightRex),       // =($0, $1)
    relBuilder.and(
        relBuilder.isNull(leftRex),                         // IS NULL($0)
        relBuilder.isNull(rightRex)));                      // IS NULL($1)

When RelBuilder.join() is called, it internally invokes collapseExpandedIsNotDistinctFromExpr (RelBuilder.java:3198). The IS_NULL operands ($0, $1) and the EQUALS operands ($0, $1) are identical — removeNullabilityCast has no effect on bare InputRef nodes. The equality check in doCollapseExpandedIsNotDistinctFrom (RelOptUtil.java:1883) passes, and the condition is collapsed to IS NOT DISTINCT FROM($0, $1).

New Rule (Calcite) — Collapse Fails
IntersectToSemiJoinRule calls builder.isNotDistinctFrom(), which internally calls RelOptUtil.isDistinctFromInternal() (RelOptUtil.java:2417-2425). This generates:

OR(AND(IS_NULL(CAST($0)), IS_NULL(CAST($1))),
IS_TRUE(=(CAST($0), CAST($1))))
The CAST nodes come from the rule's use of makeCast(leastFieldType, ...) (IntersectToSemiJoinRule.java:125-128) to align field types to the Intersect's leastRowType.

When collapseExpandedIsNotDistinctFromExpr processes this condition, it reaches doCollapseExpandedIsNotDistinctFrom (RelOptUtil.java:1873-1890):

// RelOptUtil.java:1875-1881
final RexNode isNullInput0 = ifNull0Call.getOperands().get(0);     // CAST($0) — no transform
final RexNode isNullInput1 = ifNull1Call.getOperands().get(0);     // CAST($1) — no transform
final RexNode equalsInput0 = RexUtil
    .removeNullabilityCast(rexBuilder.getTypeFactory(), equalsCall.getOperands().get(0)); // $0 — CAST removed!
final RexNode equalsInput1 = RexUtil
    .removeNullabilityCast(rexBuilder.getTypeFactory(), equalsCall.getOperands().get(1)); // $1 — CAST removed!

removeNullabilityCast strips CAST nodes that only change nullability (e.g., CAST($0 AS INT NOT NULL) → $0). It is applied only to the EQUALS operands, not to the IS_NULL operands. This creates an asymmetry:

isNullInput0 = CAST($0 AS type) (untouched)
equalsInput0 = $0 (CAST stripped by removeNullabilityCast)
Since CAST($0) ≠ $0, the equality check fails (RelOptUtil.java:1883-1884), and the collapse returns the original expression unchanged.

After the failed collapse, simplifyUnknownAsFalse (RelBuilder.java:3201) takes over:

IS_TRUE(=($0, $1)) → =($0, $1) (simplification removes IS_TRUE wrapper)
Remaining CAST nodes are also simplified away
The final output is: OR(AND(IS NULL($0), IS NULL($1)), =($0, $1))

@github-actions github-actions Bot added the community-reviewed PR has been reviewed by the community. label Jun 25, 2026
@liuyongvs liuyongvs force-pushed the flink-39991 branch 6 times, most recently from f5edddb to b592bc8 Compare June 25, 2026 10:06

// set operators
ReplaceIntersectWithSemiJoinRule.INSTANCE,
CoreRules.INTERSECT_TO_SEMI_JOIN,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My AI says

  • Verify that the expanded join condition doesn't negatively impact query performance in production workloads
  • Confirm that ReplaceMinusWithAntiJoinRule removal (mentioned but not shown) is equally safe
  • Consider adding a comment in the rule sets explaining why the Calcite rules are preferred over custom implementations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants