fix: resolve MTR data inconsistency caused by binlog rotation by dnovitski · Pull Request #1 · dnovitski/gh-ost

dnovitski · 2026-04-29T00:43:48Z

Summary

Fixes intermittent data inconsistency in the multithreaded replication (MTR) coordinator introduced in github/gh-ost#1454.

Root Cause

MySQL's logical clock (last_committed, sequence_number) is per-binlog-file. When max_binlog_size triggers a binlog rotation, sequence_number resets to 1. However, the coordinator's lowWaterMark (lwm) was never reset — it retained the old file's high value (e.g., 65553). After rotation, all WaitForTransaction(lastCommitted) checks passed immediately (lwm >= lastCommitted trivially true), causing transactions from the new binlog file to execute out of order.

Example of the bug

Before rotation: lwm = 65553
After rotation:  sequence numbers restart at 1, 2, 3, ...
Transaction with lastCommitted=5 → check: 65553 >= 5 → TRUE (should wait!)

This caused dependent transactions to execute concurrently, resulting in wrong final values (e.g., k=5046 instead of k=5047).

Bugs Fixed

Bug 1: Binlog rotation state reset (THE ROOT CAUSE)

Problem: lowWaterMark never reset on binlog rotation → stale lwm allows out-of-order execution
Fix: Initialize lwm to -1 (sentinel for "uninitialized"). On RotateEvent, drain all busy workers, then reset lwm=-1 and clear completedJobs/waitingJobs maps. The drain creates a barrier only at binlog file boundaries (acceptable overhead).

Bug 2: Silent error swallowing in DML apply

Problem: applyDMLEvents() errors were logged but silently discarded; MarkTransactionCompleted was called regardless, corrupting dependency tracking
Fix: Retry InnoDB deadlocks (error 1213) and lock wait timeouts (error 1205) with jittered exponential backoff (up to 100 retries, matching MySQL's slave_transaction_retries). Propagate fatal (non-retryable) errors via a broadcast channel (failedCh).

Bug 3: Wait channel deadlock on error paths

Problem: WaitForTransaction used unbuffered channels. If a waiter exited early via failedCh, the subsequent MarkTransactionCompleted send would block forever.
Fix: Use buffered channels (capacity 1) so the send never blocks.

Bug 4: Data race on lwm read in RotateEvent handler

Problem: if c.lowWaterMark >= 0 was read without holding c.mu, racing with concurrent MarkTransactionCompleted calls.
Fix: Guard the read with c.mu.Lock()/c.mu.Unlock().

Verification

20+ consistency test iterations passed at rate=1200 trx/s, 4 workers, 90s sysbench load (was ~60% failure rate before fix)
Independent verification by second agent: 48-minute stress test, 10 binlog rotations, 0 data mismatches
go build ./... ✅, go vet ./... ✅

Performance: MTR vs Baseline

Benchmarked with 200K rows, 1000 trx/s sysbench write load for 90 seconds:

Configuration	Total Time	DML Events/s	Row-copy starvation
Baseline (no MTR)	166s	1,463/s	~150s (0 rows copied during load!)
MTR, 4 workers	135s	2,049/s	~120s
MTR, 4 workers, batch=50	150s	2,156/s	~135s

Key finding: MTR provides ~19% improvement in total migration time. The fundamental bottleneck is executeWriteFuncs which calls ProcessEventsUntilDrained() before each row-copy chunk — under high write load, the event queue fills continuously and row-copy gets starved regardless of worker count. MTR helps by draining the queue faster with parallel workers.

Files Changed

go/logic/coordinator.go — 189 insertions, 42 deletions

Known Limitation

buildDMLEventQuery in applier.go mutates dmlEvent.DML for unique-key UPDATE operations (sets to DeleteDML then InsertDML, never restores). This is a pre-existing bug that does not affect sysbench workloads (PK-only) but could cause issues with unique-key modifications. Not addressed in this PR.

…github#1684) * Fix resume data loss: route heartbeat coords through applyEventsQueue onChangelogHeartbeatEvent was mutating applier.CurrentCoordinates directly from the streamer goroutine, before any DML that preceded the heartbeat was applied to the ghost table. The checkpoint loop reads CurrentCoordinates as "applied through this GTID" and could persist a checkpoint whose LastTrxCoords was ahead of what was actually applied. If gh-ost crashed before applyEventsQueue drained, --resume read that checkpoint and called StartSyncGTID with the persisted set; MySQL treated the un-applied GTIDs as already-seen and never re-streamed them. The ghost table silently lost those DMLs and cut-over produced a stale table. Fix: enqueue a tableWriteFunc onto applyEventsQueue that performs the coords bump. The apply goroutine executes it in order, after the DMLs the streamer enqueued before the heartbeat, restoring the invariant. Adds TestMigratorHeartbeatDoesNotAdvancePastUnappliedDML, which fails at the previous HEAD and passes after the fix; also asserts queue ordering to guard against future changes that wrap the heartbeat enqueue in a goroutine. Co-authored-by: Bastian Bartmann <bastian.bartmann@shopify.com> * Replace direct channel write with SendWithContext Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Bastian Bartmann <bastian.bartmann@shopify.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* Add Datadog/statsd with simple client emitting startup * Add go runtime metrics to statsd reporting --------- Co-authored-by: meiji163 <meiji163@github.com>

Adds parallel DML event processing via a coordinator that manages worker goroutines using MySQL's LOGICAL_CLOCK dependency tracking. Key fixes for data inconsistency: - Reset lowWaterMark on binlog rotation (sequence numbers are per-file) - Drain all workers before resetting coordinator state - Retry InnoDB deadlocks with jittered exponential backoff - Propagate fatal errors via broadcast channel - Use buffered wait channels to prevent deadlocks on error paths - Guard all lowWaterMark reads with mutex - Remove dead commented-out legacy EventsStreamer code - Add deterministic rotation regression tests Co-authored-by: meiji163 <meiji163@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…r, cut per-chunk round-trips The --chunk-concurrent-size parallel row-copy only ran the INSERTs in parallel; the boundary calculation and the per-chunk transaction overhead serialized work and capped the achievable speedup well below the hardware's parallel-insert ceiling. This addresses three of those caps. Prefetch range producer (overlap serialized boundary calc with INSERTs): - A single dedicated producer goroutine is the sole caller of CalculateNextIterationRangeEndValues and streams pre-computed ranges into a buffered channel, so boundary scans now overlap the parallel INSERTs of earlier work instead of stalling between batches. - Split iterateChunks into iterateChunksSingle (unchanged single-threaded semantics) and iterateChunksConcurrent. - Size the applier pool for concurrentSize + producer + headroom. #1 Per-chunk round-trips (applier.go): - ApplyIterationInsertQuery sent BEGIN / SET SESSION / INSERT / COMMIT as four round-trips per chunk. It now sends "SET SESSION ...; INSERT ..." as a single autocommit, multi-statement round-trip on one pinned connection. The applier pool already enables multiStatements + interpolateParams + autocommit; RowsAffected() reports the INSERT (last statement), and the optional SHOW WARNINGS runs on the same pinned connection. 4 round-trips -> 1. #2 Persistent worker pool (migrator.go): - Replace the per-batch errgroup+g.Wait barrier (which stalled N workers on the slowest chunk every N chunks) with continuous dispatch to an errgroup bounded by SetLimit(concurrentSize) for a 200ms time quantum. Workers stay saturated; the only barrier is at the quantum boundary. The time bound keeps executeWriteFuncs returning to apply binlog events and re-check throttling, preserving row-copy/event mutual exclusion. Checkpoints record the last contiguous completed range (not the producer's prefetched cursor), so resume restarts from fully-copied data. Benchmarked on MySQL 8.0.46 (innodb_autoinc_lock_mode=2), 2.1M rows: copy time vs the prior parallel impl improved up to 32% (chunk=200, conc=4: 22s->15s; chunk=1000, conc=8: 8s->6s). Data integrity verified by row count + checksum. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dnovitski mentioned this pull request Apr 29, 2026

Multithreaded replication WIP github/gh-ost#1454

Draft

2 tasks

dnovitski force-pushed the meiji163/parallel-repl branch 6 times, most recently from ad143d3 to 4b099b0 Compare April 29, 2026 01:37

This was referenced Apr 29, 2026

Multithreaded replication, parallel row-copy with DML merge, frontier filter, and heartbeat lag throttle #2

Open

Multithreaded replication, parallel row-copy with DML merge, frontier filter, and heartbeat lag throttle github/gh-ost#1665

Open

grodowski and others added 3 commits May 26, 2026 10:48

Add Datadog/statsd with simple client emitting startup (github#1689)

e59af3a

Add Go runtime metrics to statsd reporting (github#1690)

59f400f

* Add Datadog/statsd with simple client emitting startup * Add go runtime metrics to statsd reporting --------- Co-authored-by: meiji163 <meiji163@github.com>

dnovitski force-pushed the meiji163/parallel-repl branch from 4b099b0 to 5705044 Compare May 28, 2026 10:02

dnovitski force-pushed the meiji163/parallel-repl branch from 5705044 to 8db5fbe Compare May 28, 2026 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve MTR data inconsistency caused by binlog rotation#1

fix: resolve MTR data inconsistency caused by binlog rotation#1
dnovitski wants to merge 4 commits into
masterfrom
meiji163/parallel-repl

dnovitski commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dnovitski commented Apr 29, 2026

Summary

Root Cause

Example of the bug

Bugs Fixed

Bug 1: Binlog rotation state reset (THE ROOT CAUSE)

Bug 2: Silent error swallowing in DML apply

Bug 3: Wait channel deadlock on error paths

Bug 4: Data race on lwm read in RotateEvent handler

Verification

Performance: MTR vs Baseline

Files Changed

Known Limitation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants