fix(sharding): enqueue delete for deferred updates moving out of migrating chunk#1819
Open
Abuhaithem wants to merge 1 commit into
Open
Conversation
…ating chunk If a deferred transaction update moves a document out of a migrating chunk and the document is subsequently deleted before `_processDeferredXferMods` runs, the destination shard misses the deletion, leaving an orphaned document. This commit updates `_processDeferredXferMods` to model a delete if the document is no longer found and its pre-image was in the chunk.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request addresses a severe data inconsistency bug where an orphaned document can be left on a migration destination shard.
When a chunk migration is ongoing, the migration cloner tracks updates to documents inside the chunk's range. For transactions prepared on the donor in a previous term but committed in the current term, the oplog entry lacks a
postImagedocument key. The cloner handles these by deferring their processing via_deferProcessingForXferModand resolving them later duringnextModsBatchvia_processDeferredXferMods.However, a serious race condition exists if the deferred update moves a document out of the migrating chunk, and the document is subsequently deleted before
_processDeferredXferModscan run:postImageDocKeyis empty, and the update is deferred.onDeleteOpevaluates the deletion. Because the document's new shard key is out of bounds for the migrating chunk,onDeleteOpreturns early and does not enqueue a deletion toxferMods._processDeferredXferModsruns. It attempts to fetch the document usingHelpers::findById.findByIdfails. The code previously assumed: "That delete would have been captured by the xferMods so nothing else to do here," and executedcontinue.Because the deletion was skipped by
onDeleteOp(as it occurred outside the chunk bounds), the destination shard is never instructed to delete the document. The migration completes with the document incorrectly cloned to the destination shard, resulting in an orphaned document.The Fix:
This PR fixes the bug by having
_processDeferredXferModsextract the shard key from the deferred update'spreImageDocKey. If thepreImageDocKeywas within the bounds of the migrating chunk, a deletion is unconditionally enqueued for the recipient shard. This operation is fully idempotent and ensures the destination shard correctly deletes the document.Impact & Severity
High Severity: Silently compromises data integrity for sharded clusters during migrations combined with failovers and prepared transactions. These orphaned documents could reappear in queries or violate unique constraints.
Type of Change
Testing