Skip to content

[FLINK-39998][state] Surface root cause exception when parallel RocksDB state download fails#28552

Open
sameer2800 wants to merge 1 commit into
apache:masterfrom
sameer2800:FLINK-39998/rocksdb-download-root-cause-exception
Open

[FLINK-39998][state] Surface root cause exception when parallel RocksDB state download fails#28552
sameer2800 wants to merge 1 commit into
apache:masterfrom
sameer2800:FLINK-39998/rocksdb-download-root-cause-exception

Conversation

@sameer2800

@sameer2800 sameer2800 commented Jun 26, 2026

Copy link
Copy Markdown

What is the purpose of the change

When parallel RocksDB state downloads fail, only the first exception to arrive is surfaced — all other thread failures are silently lost. This is non-deterministic: whichever thread happens to complete first "wins" the primary exception slot, which is often a cascade ClosedChannelException rather than the actual failure.
Consider a real scenario: a checkpoint file gets accidentally deleted from S3. On job restart, Flink keeps crashing. The only error visible in the logs is ClosedChannelException, which points to a local file write issue and gives no hint that a remote file is missing. The on-call engineer has no indication of which file is gone, which checkpoint is affected, or that S3 is involved at all — making diagnosis extremely painful.

Before:
Caused by: java.io.IOException: java.nio.channels.ClosedChannelException
at org.apache.flink.state.rocksdb.RocksDBStateDownloader.downloadDataForStateHandle

After:
Caused by: java.io.IOException: 2 downloads failed with distinct errors:
[IOException: ClosedChannelException | IOException: FileNotFoundException:
No such file: s3://bucket/checkpoints/.../shared/16040859-c981-4407-bf8d-8fd6bbb66f6f]

The root cause: FutureUtils.completeAll() collects all thread failures as suppressed exceptions on a CompletionException. However, CompletableFuture.get() internally strips that wrapper
before throwing ExecutionException — so by the time the catch block runs, all parallel failures except one are permanently gone.

Brief change log

  • RocksDBStateDownloader — registers a whenComplete callback on the download future to capture the raw CompletionException (with full suppressed list) before CompletableFuture.get() strips it
  • Introduces buildDownloadException which strips the wrapper chain of each collected failure (CompletionException → RuntimeException → IOException), deduplicates by type and message, and returns either the single root cause directly or a merged exception listing all distinct failures with full stack traces preserved as suppressed entries

Verifying this change

This change added tests and can be verified as follows:

  • testSingleDownloadFailureSurfacedDirectly — verifies that a single download failure returns the root cause exception directly without a merged message
  • testRootCauseVisibleAmongCascadeFailures — verifies that the root cause IOException is visible in the error when one handle fails among many parallel ByteStreamStateHandle downloads
  • testMultipleDistinctFailuresMergedInMessage — verifies that distinct failures across N threads (coordinated via CyclicBarrier to prevent registry-closure cascade) all appear in the merged error message

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (improves error reporting on restore failure only)
  • The S3 file system connector: no

@flinkbot

flinkbot commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants