fix: resolve MTR data inconsistency caused by binlog rotation#1
Open
dnovitski wants to merge 4 commits into
Open
fix: resolve MTR data inconsistency caused by binlog rotation#1dnovitski wants to merge 4 commits into
dnovitski wants to merge 4 commits into
Conversation
2 tasks
ad143d3 to
4b099b0
Compare
…github#1684) * Fix resume data loss: route heartbeat coords through applyEventsQueue onChangelogHeartbeatEvent was mutating applier.CurrentCoordinates directly from the streamer goroutine, before any DML that preceded the heartbeat was applied to the ghost table. The checkpoint loop reads CurrentCoordinates as "applied through this GTID" and could persist a checkpoint whose LastTrxCoords was ahead of what was actually applied. If gh-ost crashed before applyEventsQueue drained, --resume read that checkpoint and called StartSyncGTID with the persisted set; MySQL treated the un-applied GTIDs as already-seen and never re-streamed them. The ghost table silently lost those DMLs and cut-over produced a stale table. Fix: enqueue a tableWriteFunc onto applyEventsQueue that performs the coords bump. The apply goroutine executes it in order, after the DMLs the streamer enqueued before the heartbeat, restoring the invariant. Adds TestMigratorHeartbeatDoesNotAdvancePastUnappliedDML, which fails at the previous HEAD and passes after the fix; also asserts queue ordering to guard against future changes that wrap the heartbeat enqueue in a goroutine. Co-authored-by: Bastian Bartmann <bastian.bartmann@shopify.com> * Replace direct channel write with SendWithContext Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Bastian Bartmann <bastian.bartmann@shopify.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* Add Datadog/statsd with simple client emitting startup * Add go runtime metrics to statsd reporting --------- Co-authored-by: meiji163 <meiji163@github.com>
4b099b0 to
5705044
Compare
Adds parallel DML event processing via a coordinator that manages worker goroutines using MySQL's LOGICAL_CLOCK dependency tracking. Key fixes for data inconsistency: - Reset lowWaterMark on binlog rotation (sequence numbers are per-file) - Drain all workers before resetting coordinator state - Retry InnoDB deadlocks with jittered exponential backoff - Propagate fatal errors via broadcast channel - Use buffered wait channels to prevent deadlocks on error paths - Guard all lowWaterMark reads with mutex - Remove dead commented-out legacy EventsStreamer code - Add deterministic rotation regression tests Co-authored-by: meiji163 <meiji163@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
5705044 to
8db5fbe
Compare
dnovitski
added a commit
that referenced
this pull request
Jun 13, 2026
…r, cut per-chunk round-trips The --chunk-concurrent-size parallel row-copy only ran the INSERTs in parallel; the boundary calculation and the per-chunk transaction overhead serialized work and capped the achievable speedup well below the hardware's parallel-insert ceiling. This addresses three of those caps. Prefetch range producer (overlap serialized boundary calc with INSERTs): - A single dedicated producer goroutine is the sole caller of CalculateNextIterationRangeEndValues and streams pre-computed ranges into a buffered channel, so boundary scans now overlap the parallel INSERTs of earlier work instead of stalling between batches. - Split iterateChunks into iterateChunksSingle (unchanged single-threaded semantics) and iterateChunksConcurrent. - Size the applier pool for concurrentSize + producer + headroom. #1 Per-chunk round-trips (applier.go): - ApplyIterationInsertQuery sent BEGIN / SET SESSION / INSERT / COMMIT as four round-trips per chunk. It now sends "SET SESSION ...; INSERT ..." as a single autocommit, multi-statement round-trip on one pinned connection. The applier pool already enables multiStatements + interpolateParams + autocommit; RowsAffected() reports the INSERT (last statement), and the optional SHOW WARNINGS runs on the same pinned connection. 4 round-trips -> 1. #2 Persistent worker pool (migrator.go): - Replace the per-batch errgroup+g.Wait barrier (which stalled N workers on the slowest chunk every N chunks) with continuous dispatch to an errgroup bounded by SetLimit(concurrentSize) for a 200ms time quantum. Workers stay saturated; the only barrier is at the quantum boundary. The time bound keeps executeWriteFuncs returning to apply binlog events and re-check throttling, preserving row-copy/event mutual exclusion. Checkpoints record the last contiguous completed range (not the producer's prefetched cursor), so resume restarts from fully-copied data. Benchmarked on MySQL 8.0.46 (innodb_autoinc_lock_mode=2), 2.1M rows: copy time vs the prior parallel impl improved up to 32% (chunk=200, conc=4: 22s->15s; chunk=1000, conc=8: 8s->6s). Data integrity verified by row count + checksum. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dnovitski
added a commit
that referenced
this pull request
Jun 13, 2026
…r, cut per-chunk round-trips The --chunk-concurrent-size parallel row-copy only ran the INSERTs in parallel; the boundary calculation and the per-chunk transaction overhead serialized work and capped the achievable speedup well below the hardware's parallel-insert ceiling. This addresses three of those caps. Prefetch range producer (overlap serialized boundary calc with INSERTs): - A single dedicated producer goroutine is the sole caller of CalculateNextIterationRangeEndValues and streams pre-computed ranges into a buffered channel, so boundary scans now overlap the parallel INSERTs of earlier work instead of stalling between batches. - Split iterateChunks into iterateChunksSingle (unchanged single-threaded semantics) and iterateChunksConcurrent. - Size the applier pool for concurrentSize + producer + headroom. #1 Per-chunk round-trips (applier.go): - ApplyIterationInsertQuery sent BEGIN / SET SESSION / INSERT / COMMIT as four round-trips per chunk. It now sends "SET SESSION ...; INSERT ..." as a single autocommit, multi-statement round-trip on one pinned connection. The applier pool already enables multiStatements + interpolateParams + autocommit; RowsAffected() reports the INSERT (last statement), and the optional SHOW WARNINGS runs on the same pinned connection. 4 round-trips -> 1. #2 Persistent worker pool (migrator.go): - Replace the per-batch errgroup+g.Wait barrier (which stalled N workers on the slowest chunk every N chunks) with continuous dispatch to an errgroup bounded by SetLimit(concurrentSize) for a 200ms time quantum. Workers stay saturated; the only barrier is at the quantum boundary. The time bound keeps executeWriteFuncs returning to apply binlog events and re-check throttling, preserving row-copy/event mutual exclusion. Checkpoints record the last contiguous completed range (not the producer's prefetched cursor), so resume restarts from fully-copied data. Benchmarked on MySQL 8.0.46 (innodb_autoinc_lock_mode=2), 2.1M rows: copy time vs the prior parallel impl improved up to 32% (chunk=200, conc=4: 22s->15s; chunk=1000, conc=8: 8s->6s). Data integrity verified by row count + checksum. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes intermittent data inconsistency in the multithreaded replication (MTR) coordinator introduced in github/gh-ost#1454.
Root Cause
MySQL's logical clock (
last_committed,sequence_number) is per-binlog-file. Whenmax_binlog_sizetriggers a binlog rotation,sequence_numberresets to 1. However, the coordinator'slowWaterMark(lwm) was never reset — it retained the old file's high value (e.g., 65553). After rotation, allWaitForTransaction(lastCommitted)checks passed immediately (lwm >= lastCommittedtrivially true), causing transactions from the new binlog file to execute out of order.Example of the bug
This caused dependent transactions to execute concurrently, resulting in wrong final values (e.g.,
k=5046instead ofk=5047).Bugs Fixed
Bug 1: Binlog rotation state reset (THE ROOT CAUSE)
lowWaterMarknever reset on binlog rotation → stale lwm allows out-of-order executionRotateEvent, drain all busy workers, then reset lwm=-1 and clearcompletedJobs/waitingJobsmaps. The drain creates a barrier only at binlog file boundaries (acceptable overhead).Bug 2: Silent error swallowing in DML apply
applyDMLEvents()errors were logged but silently discarded;MarkTransactionCompletedwas called regardless, corrupting dependency trackingslave_transaction_retries). Propagate fatal (non-retryable) errors via a broadcast channel (failedCh).Bug 3: Wait channel deadlock on error paths
WaitForTransactionused unbuffered channels. If a waiter exited early viafailedCh, the subsequentMarkTransactionCompletedsend would block forever.Bug 4: Data race on lwm read in RotateEvent handler
if c.lowWaterMark >= 0was read without holdingc.mu, racing with concurrentMarkTransactionCompletedcalls.c.mu.Lock()/c.mu.Unlock().Verification
go build ./...✅,go vet ./...✅Performance: MTR vs Baseline
Benchmarked with 200K rows, 1000 trx/s sysbench write load for 90 seconds:
Key finding: MTR provides ~19% improvement in total migration time. The fundamental bottleneck is
executeWriteFuncswhich callsProcessEventsUntilDrained()before each row-copy chunk — under high write load, the event queue fills continuously and row-copy gets starved regardless of worker count. MTR helps by draining the queue faster with parallel workers.Files Changed
go/logic/coordinator.go— 189 insertions, 42 deletionsKnown Limitation
buildDMLEventQueryinapplier.gomutatesdmlEvent.DMLfor unique-key UPDATE operations (sets to DeleteDML then InsertDML, never restores). This is a pre-existing bug that does not affect sysbench workloads (PK-only) but could cause issues with unique-key modifications. Not addressed in this PR.