Skip to content

fix: resolve MTR data inconsistency caused by binlog rotation#1

Open
dnovitski wants to merge 4 commits into
masterfrom
meiji163/parallel-repl
Open

fix: resolve MTR data inconsistency caused by binlog rotation#1
dnovitski wants to merge 4 commits into
masterfrom
meiji163/parallel-repl

Conversation

@dnovitski

Copy link
Copy Markdown
Owner

Summary

Fixes intermittent data inconsistency in the multithreaded replication (MTR) coordinator introduced in github/gh-ost#1454.

Root Cause

MySQL's logical clock (last_committed, sequence_number) is per-binlog-file. When max_binlog_size triggers a binlog rotation, sequence_number resets to 1. However, the coordinator's lowWaterMark (lwm) was never reset — it retained the old file's high value (e.g., 65553). After rotation, all WaitForTransaction(lastCommitted) checks passed immediately (lwm >= lastCommitted trivially true), causing transactions from the new binlog file to execute out of order.

Example of the bug

Before rotation: lwm = 65553
After rotation:  sequence numbers restart at 1, 2, 3, ...
Transaction with lastCommitted=5 → check: 65553 >= 5 → TRUE (should wait!)

This caused dependent transactions to execute concurrently, resulting in wrong final values (e.g., k=5046 instead of k=5047).

Bugs Fixed

Bug 1: Binlog rotation state reset (THE ROOT CAUSE)

  • Problem: lowWaterMark never reset on binlog rotation → stale lwm allows out-of-order execution
  • Fix: Initialize lwm to -1 (sentinel for "uninitialized"). On RotateEvent, drain all busy workers, then reset lwm=-1 and clear completedJobs/waitingJobs maps. The drain creates a barrier only at binlog file boundaries (acceptable overhead).

Bug 2: Silent error swallowing in DML apply

  • Problem: applyDMLEvents() errors were logged but silently discarded; MarkTransactionCompleted was called regardless, corrupting dependency tracking
  • Fix: Retry InnoDB deadlocks (error 1213) and lock wait timeouts (error 1205) with jittered exponential backoff (up to 100 retries, matching MySQL's slave_transaction_retries). Propagate fatal (non-retryable) errors via a broadcast channel (failedCh).

Bug 3: Wait channel deadlock on error paths

  • Problem: WaitForTransaction used unbuffered channels. If a waiter exited early via failedCh, the subsequent MarkTransactionCompleted send would block forever.
  • Fix: Use buffered channels (capacity 1) so the send never blocks.

Bug 4: Data race on lwm read in RotateEvent handler

  • Problem: if c.lowWaterMark >= 0 was read without holding c.mu, racing with concurrent MarkTransactionCompleted calls.
  • Fix: Guard the read with c.mu.Lock()/c.mu.Unlock().

Verification

  • 20+ consistency test iterations passed at rate=1200 trx/s, 4 workers, 90s sysbench load (was ~60% failure rate before fix)
  • Independent verification by second agent: 48-minute stress test, 10 binlog rotations, 0 data mismatches
  • go build ./... ✅, go vet ./...

Performance: MTR vs Baseline

Benchmarked with 200K rows, 1000 trx/s sysbench write load for 90 seconds:

Configuration Total Time DML Events/s Row-copy starvation
Baseline (no MTR) 166s 1,463/s ~150s (0 rows copied during load!)
MTR, 4 workers 135s 2,049/s ~120s
MTR, 4 workers, batch=50 150s 2,156/s ~135s

Key finding: MTR provides ~19% improvement in total migration time. The fundamental bottleneck is executeWriteFuncs which calls ProcessEventsUntilDrained() before each row-copy chunk — under high write load, the event queue fills continuously and row-copy gets starved regardless of worker count. MTR helps by draining the queue faster with parallel workers.

Files Changed

  • go/logic/coordinator.go — 189 insertions, 42 deletions

Known Limitation

buildDMLEventQuery in applier.go mutates dmlEvent.DML for unique-key UPDATE operations (sets to DeleteDML then InsertDML, never restores). This is a pre-existing bug that does not affect sysbench workloads (PK-only) but could cause issues with unique-key modifications. Not addressed in this PR.

grodowski and others added 3 commits May 26, 2026 10:48
…github#1684)

* Fix resume data loss: route heartbeat coords through applyEventsQueue

onChangelogHeartbeatEvent was mutating applier.CurrentCoordinates directly
from the streamer goroutine, before any DML that preceded the heartbeat was
applied to the ghost table. The checkpoint loop reads CurrentCoordinates as
"applied through this GTID" and could persist a checkpoint whose
LastTrxCoords was ahead of what was actually applied.

If gh-ost crashed before applyEventsQueue drained, --resume read that
checkpoint and called StartSyncGTID with the persisted set; MySQL treated
the un-applied GTIDs as already-seen and never re-streamed them. The ghost
table silently lost those DMLs and cut-over produced a stale table.

Fix: enqueue a tableWriteFunc onto applyEventsQueue that performs the
coords bump. The apply goroutine executes it in order, after the DMLs the
streamer enqueued before the heartbeat, restoring the invariant.

Adds TestMigratorHeartbeatDoesNotAdvancePastUnappliedDML, which fails at
the previous HEAD and passes after the fix; also asserts queue ordering to
guard against future changes that wrap the heartbeat enqueue in a goroutine.

Co-authored-by: Bastian Bartmann <bastian.bartmann@shopify.com>

* Replace direct channel write with SendWithContext

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Bastian Bartmann <bastian.bartmann@shopify.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* Add Datadog/statsd with simple client emitting startup

* Add go runtime metrics to statsd reporting

---------

Co-authored-by: meiji163 <meiji163@github.com>
@dnovitski dnovitski force-pushed the meiji163/parallel-repl branch from 4b099b0 to 5705044 Compare May 28, 2026 10:02
Adds parallel DML event processing via a coordinator that manages
worker goroutines using MySQL's LOGICAL_CLOCK dependency tracking.

Key fixes for data inconsistency:
- Reset lowWaterMark on binlog rotation (sequence numbers are per-file)
- Drain all workers before resetting coordinator state
- Retry InnoDB deadlocks with jittered exponential backoff
- Propagate fatal errors via broadcast channel
- Use buffered wait channels to prevent deadlocks on error paths
- Guard all lowWaterMark reads with mutex
- Remove dead commented-out legacy EventsStreamer code
- Add deterministic rotation regression tests

Co-authored-by: meiji163 <meiji163@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dnovitski dnovitski force-pushed the meiji163/parallel-repl branch from 5705044 to 8db5fbe Compare May 28, 2026 10:07
dnovitski added a commit that referenced this pull request Jun 13, 2026
…r, cut per-chunk round-trips

The --chunk-concurrent-size parallel row-copy only ran the INSERTs in
parallel; the boundary calculation and the per-chunk transaction overhead
serialized work and capped the achievable speedup well below the hardware's
parallel-insert ceiling. This addresses three of those caps.

Prefetch range producer (overlap serialized boundary calc with INSERTs):
- A single dedicated producer goroutine is the sole caller of
  CalculateNextIterationRangeEndValues and streams pre-computed ranges into a
  buffered channel, so boundary scans now overlap the parallel INSERTs of
  earlier work instead of stalling between batches.
- Split iterateChunks into iterateChunksSingle (unchanged single-threaded
  semantics) and iterateChunksConcurrent.
- Size the applier pool for concurrentSize + producer + headroom.

#1 Per-chunk round-trips (applier.go):
- ApplyIterationInsertQuery sent BEGIN / SET SESSION / INSERT / COMMIT as four
  round-trips per chunk. It now sends "SET SESSION ...; INSERT ..." as a single
  autocommit, multi-statement round-trip on one pinned connection. The applier
  pool already enables multiStatements + interpolateParams + autocommit;
  RowsAffected() reports the INSERT (last statement), and the optional
  SHOW WARNINGS runs on the same pinned connection. 4 round-trips -> 1.

#2 Persistent worker pool (migrator.go):
- Replace the per-batch errgroup+g.Wait barrier (which stalled N workers on
  the slowest chunk every N chunks) with continuous dispatch to an errgroup
  bounded by SetLimit(concurrentSize) for a 200ms time quantum. Workers stay
  saturated; the only barrier is at the quantum boundary. The time bound keeps
  executeWriteFuncs returning to apply binlog events and re-check throttling,
  preserving row-copy/event mutual exclusion.

Checkpoints record the last contiguous completed range (not the producer's
prefetched cursor), so resume restarts from fully-copied data.

Benchmarked on MySQL 8.0.46 (innodb_autoinc_lock_mode=2), 2.1M rows: copy time
vs the prior parallel impl improved up to 32% (chunk=200, conc=4: 22s->15s;
chunk=1000, conc=8: 8s->6s). Data integrity verified by row count + checksum.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dnovitski added a commit that referenced this pull request Jun 13, 2026
…r, cut per-chunk round-trips

The --chunk-concurrent-size parallel row-copy only ran the INSERTs in
parallel; the boundary calculation and the per-chunk transaction overhead
serialized work and capped the achievable speedup well below the hardware's
parallel-insert ceiling. This addresses three of those caps.

Prefetch range producer (overlap serialized boundary calc with INSERTs):
- A single dedicated producer goroutine is the sole caller of
  CalculateNextIterationRangeEndValues and streams pre-computed ranges into a
  buffered channel, so boundary scans now overlap the parallel INSERTs of
  earlier work instead of stalling between batches.
- Split iterateChunks into iterateChunksSingle (unchanged single-threaded
  semantics) and iterateChunksConcurrent.
- Size the applier pool for concurrentSize + producer + headroom.

#1 Per-chunk round-trips (applier.go):
- ApplyIterationInsertQuery sent BEGIN / SET SESSION / INSERT / COMMIT as four
  round-trips per chunk. It now sends "SET SESSION ...; INSERT ..." as a single
  autocommit, multi-statement round-trip on one pinned connection. The applier
  pool already enables multiStatements + interpolateParams + autocommit;
  RowsAffected() reports the INSERT (last statement), and the optional
  SHOW WARNINGS runs on the same pinned connection. 4 round-trips -> 1.

#2 Persistent worker pool (migrator.go):
- Replace the per-batch errgroup+g.Wait barrier (which stalled N workers on
  the slowest chunk every N chunks) with continuous dispatch to an errgroup
  bounded by SetLimit(concurrentSize) for a 200ms time quantum. Workers stay
  saturated; the only barrier is at the quantum boundary. The time bound keeps
  executeWriteFuncs returning to apply binlog events and re-check throttling,
  preserving row-copy/event mutual exclusion.

Checkpoints record the last contiguous completed range (not the producer's
prefetched cursor), so resume restarts from fully-copied data.

Benchmarked on MySQL 8.0.46 (innodb_autoinc_lock_mode=2), 2.1M rows: copy time
vs the prior parallel impl improved up to 32% (chunk=200, conc=4: 22s->15s;
chunk=1000, conc=8: 8s->6s). Data integrity verified by row count + checksum.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants