Skip to content

Multithreaded replication, parallel row-copy with DML merge, frontier filter, and heartbeat lag throttle#2

Open
dnovitski wants to merge 1 commit into
masterfrom
perf/parallel-rowcopy
Open

Multithreaded replication, parallel row-copy with DML merge, frontier filter, and heartbeat lag throttle#2
dnovitski wants to merge 1 commit into
masterfrom
perf/parallel-rowcopy

Conversation

@dnovitski

@dnovitski dnovitski commented Apr 29, 2026

Copy link
Copy Markdown
Owner

Performance Optimizations for gh-ost

Note: This PR incorporates and supersedes the changes from #1 (multithreaded replication data inconsistency fix).

This PR adds several performance optimizations to gh-ost that significantly speed up row-copy under high write load while keeping binlog lag bounded.

Features

1. Parallel Row-Copy (inspired by feat concurrent chunk data #1398)

  • --copy-concurrency=N — parallel row-copy workers (default 1)
  • Bounded drain budget gives row-copy more execution turns instead of blocking indefinitely on DML drain

2. DML Event Merging (inspired by feat binlog apply optimization #1378)

  • Merges redundant DML events for the same row before applying
  • Under high write load, reduces applied statements by ~36%
  • Example: INSERT + UPDATE + UPDATE → single INSERT with final values
  • Disable with --skip-dml-merge

3. Frontier Filter (inspired by feat binlog apply optimization #1378)

  • Skips DML events for rows not yet copied (row-copy will capture latest value)
  • Reduces redundant work during the copy phase
  • Only active when --copy-concurrency=1 (single-copy): with parallel copy, multiple chunks are in-flight simultaneously so the frontier position is not a reliable boundary — in-flight chunks may not have committed yet, making it unsafe to skip events beyond the frontier
  • Automatically disabled in replica modes (TestOnReplica/MigrateOnReplica): in replica mode, binlog events are read from the replica's relay log and may be ahead of the SQL thread's apply position — row-copy SELECT queries may not yet see the data from skipped events, causing silent data loss
  • Disable with --skip-dml-frontier-filter

4. Heartbeat Lag Throttle

  • --copy-max-lag-millis (default 60000) prevents unbounded binlog lag growth during parallel row-copy
  • When HeartbeatLag exceeds threshold, pauses row-copy and drains exclusively
  • Resumes at threshold/2 (hysteresis prevents oscillation)
  • Set to 0 to disable (maximum copy speed, unbounded lag)
  • See documentation for detailed comparison with --max-lag-millis

Runtime-Changeable Flags

  • copy-concurrency=<N> — change parallel copy workers at runtime (range 1-32)
  • copy-max-lag-millis=<N> — change heartbeat lag threshold at runtime (0 = disable)
  • See interactive commands documentation for usage

Bug Fixes

  • Fixed buildDMLEventQuery DML mutation: UPDATE operations on unique-key tables no longer corrupt the shared DMLEvent object
  • Fixed frontier filter race in replica mode: Events read from binlog could be ahead of replica SQL thread position, causing missed changes
  • Fixed copy starvation with parallel row-copy: Unbuffered copyRowsQueue channel combined with HeartbeatLag sentinel value (before first heartbeat) caused copy to never get execution turns. Fixed with buffered channel and sentinel filtering

Benchmark Results (4-thread sysbench, 100K rows, 15-min runs)

Configuration Copy Time Max HeartbeatLag DML Events/sec Result
All features (no throttle) 23s 207s ⚠️ 983 PASS
All features + lag throttle (60s) 41s ~55s ~950 PASS
No DML merge 71s 262s 722 PASS
No frontier filter 28s 200s 970 PASS
Single-copy baseline 10m47s 6.6s 905 PASS

Key takeaways:

  • Parallel copy with throttle: 16x faster than single-copy baseline (41s vs 10m47s)
  • HeartbeatLag stays bounded at ~55s (vs 207s without throttle)
  • DML merge provides ~36% more events/sec throughput
  • All configurations pass data consistency checks (row counts, NULL PKs, duplicate PKs, checksums)

HeartbeatLag Analysis

Without the throttle, binlog lag grows unboundedly because the bounded drain (50ms budget) gives row-copy more turns at the expense of DML processing. The lag throttle resolves this:

  • During copy phase: lag may briefly reach threshold (~55-60s), then copy pauses
  • During throttle pause: exclusive DML drain brings lag back to ~30s (threshold/2)
  • After copy completes: DML catch-up drains remaining lag to 0 within minutes
  • At cutover: lag is always near 0 (normal gh-ost cutover behavior)

New CLI Flags

Flag Default Description
--copy-max-lag-millis 60000 Max heartbeat lag before throttling row-copy (0 = disabled)
--skip-dml-merge false Disable DML event merging (for benchmarking)
--skip-dml-frontier-filter false Disable frontier filter optimization (for benchmarking)

Testing

  • All existing integration tests pass (MySQL 5.7, 8.0, 8.4, Percona 8.0)
  • New integration test for DML event merging (merge-dml-events)
  • New integration test for parallel row-copy with lag throttle (parallel-rowcopy-lag-throttle)
  • Unit tests for runtime-changeable flag commands (12 test cases)
  • 15-minute sysbench consistency tests under 4-thread concurrent write load
  • Data consistency validated: row counts, NULL PKs, unique PKs, checksums

@dnovitski dnovitski changed the title perf: parallel row-copy with dedicated connection pool and time-bounded drain perf: parallel row-copy, DML event merging, and adaptive drain Apr 29, 2026
@dnovitski dnovitski force-pushed the perf/parallel-rowcopy branch from 8b0acb3 to 9ab008d Compare April 29, 2026 09:25
@dnovitski dnovitski changed the title perf: parallel row-copy, DML event merging, and adaptive drain perf: Parallel row-copy with DML merge, frontier filter, and heartbeat lag throttle Apr 29, 2026
@dnovitski dnovitski force-pushed the perf/parallel-rowcopy branch 3 times, most recently from dd5dfd9 to 8a5b648 Compare April 29, 2026 20:12
@dnovitski dnovitski closed this Apr 29, 2026
@dnovitski dnovitski reopened this Apr 29, 2026
@dnovitski dnovitski force-pushed the perf/parallel-rowcopy branch from 8a5b648 to 3110d30 Compare April 29, 2026 20:22
@dnovitski dnovitski changed the base branch from mtr-squashed to master April 29, 2026 20:23
dnovitski added a commit that referenced this pull request Apr 29, 2026
…ttle (#2)

Performance optimizations for gh-ost that significantly speed up row-copy
under high write load while keeping binlog lag bounded:

- Parallel row-copy with dedicated connection pool and time-bounded drain
- DML event merging within batches (INSERT/DELETE cancellation, UPDATE folding)
- Frontier filter to skip DML events beyond copy frontier
- Heartbeat lag throttle (--copy-max-lag-millis) for row-copy pacing
- Adaptive drain budget and auto-tuning chunk size
- Runtime-changeable --copy-concurrency and --copy-max-lag-millis
- Fix multithreaded replication data inconsistency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dnovitski dnovitski force-pushed the perf/parallel-rowcopy branch from 3110d30 to ca7577a Compare April 29, 2026 20:35
@dnovitski dnovitski changed the title perf: Parallel row-copy with DML merge, frontier filter, and heartbeat lag throttle Multithreaded replication, parallel row-copy with DML merge, frontier filter, and heartbeat lag throttle Apr 29, 2026
dnovitski added a commit that referenced this pull request May 23, 2026
…ttle (#2)

Performance optimizations for gh-ost that significantly speed up row-copy
under high write load while keeping binlog lag bounded:

- Parallel row-copy with dedicated connection pool and time-bounded drain
- DML event merging within batches (INSERT/DELETE cancellation, UPDATE folding)
- Frontier filter to skip DML events beyond copy frontier
- Heartbeat lag throttle (--copy-max-lag-millis) for row-copy pacing
- Adaptive drain budget and auto-tuning chunk size
- Runtime-changeable --copy-concurrency and --copy-max-lag-millis
- Fix multithreaded replication data inconsistency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dnovitski dnovitski force-pushed the perf/parallel-rowcopy branch from ca7577a to a9ac404 Compare May 23, 2026 21:17
dnovitski added a commit that referenced this pull request May 23, 2026
…ttle (#2)

Performance optimizations for gh-ost that significantly speed up row-copy
under high write load while keeping binlog lag bounded:

- Parallel row-copy with dedicated connection pool and time-bounded drain
- DML event merging within batches (INSERT/DELETE cancellation, UPDATE folding)
- Frontier filter to skip DML events beyond copy frontier
- Heartbeat lag throttle (--copy-max-lag-millis) for row-copy pacing
- Adaptive drain budget and auto-tuning chunk size
- Runtime-changeable --copy-concurrency and --copy-max-lag-millis
- Fix multithreaded replication data inconsistency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dnovitski dnovitski force-pushed the perf/parallel-rowcopy branch from a9ac404 to 79e3ff8 Compare May 23, 2026 21:21
…ttle (#2)

Performance optimizations for gh-ost that significantly speed up row-copy
under high write load while keeping binlog lag bounded:

- Parallel row-copy with dedicated connection pool and time-bounded drain
- DML event merging within batches (INSERT/DELETE cancellation, UPDATE folding)
- Frontier filter to skip DML events beyond copy frontier
- Heartbeat lag throttle (--copy-max-lag-millis) for row-copy pacing
- Adaptive drain budget and auto-tuning chunk size
- Runtime-changeable --copy-concurrency and --copy-max-lag-millis
- Fix multithreaded replication data inconsistency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dnovitski dnovitski force-pushed the perf/parallel-rowcopy branch from 79e3ff8 to 908d561 Compare May 23, 2026 21:25
dnovitski added a commit that referenced this pull request Jun 13, 2026
…r, cut per-chunk round-trips

The --chunk-concurrent-size parallel row-copy only ran the INSERTs in
parallel; the boundary calculation and the per-chunk transaction overhead
serialized work and capped the achievable speedup well below the hardware's
parallel-insert ceiling. This addresses three of those caps.

Prefetch range producer (overlap serialized boundary calc with INSERTs):
- A single dedicated producer goroutine is the sole caller of
  CalculateNextIterationRangeEndValues and streams pre-computed ranges into a
  buffered channel, so boundary scans now overlap the parallel INSERTs of
  earlier work instead of stalling between batches.
- Split iterateChunks into iterateChunksSingle (unchanged single-threaded
  semantics) and iterateChunksConcurrent.
- Size the applier pool for concurrentSize + producer + headroom.

#1 Per-chunk round-trips (applier.go):
- ApplyIterationInsertQuery sent BEGIN / SET SESSION / INSERT / COMMIT as four
  round-trips per chunk. It now sends "SET SESSION ...; INSERT ..." as a single
  autocommit, multi-statement round-trip on one pinned connection. The applier
  pool already enables multiStatements + interpolateParams + autocommit;
  RowsAffected() reports the INSERT (last statement), and the optional
  SHOW WARNINGS runs on the same pinned connection. 4 round-trips -> 1.

#2 Persistent worker pool (migrator.go):
- Replace the per-batch errgroup+g.Wait barrier (which stalled N workers on
  the slowest chunk every N chunks) with continuous dispatch to an errgroup
  bounded by SetLimit(concurrentSize) for a 200ms time quantum. Workers stay
  saturated; the only barrier is at the quantum boundary. The time bound keeps
  executeWriteFuncs returning to apply binlog events and re-check throttling,
  preserving row-copy/event mutual exclusion.

Checkpoints record the last contiguous completed range (not the producer's
prefetched cursor), so resume restarts from fully-copied data.

Benchmarked on MySQL 8.0.46 (innodb_autoinc_lock_mode=2), 2.1M rows: copy time
vs the prior parallel impl improved up to 32% (chunk=200, conc=4: 22s->15s;
chunk=1000, conc=8: 8s->6s). Data integrity verified by row count + checksum.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dnovitski added a commit that referenced this pull request Jun 13, 2026
…r, cut per-chunk round-trips

The --chunk-concurrent-size parallel row-copy only ran the INSERTs in
parallel; the boundary calculation and the per-chunk transaction overhead
serialized work and capped the achievable speedup well below the hardware's
parallel-insert ceiling. This addresses three of those caps.

Prefetch range producer (overlap serialized boundary calc with INSERTs):
- A single dedicated producer goroutine is the sole caller of
  CalculateNextIterationRangeEndValues and streams pre-computed ranges into a
  buffered channel, so boundary scans now overlap the parallel INSERTs of
  earlier work instead of stalling between batches.
- Split iterateChunks into iterateChunksSingle (unchanged single-threaded
  semantics) and iterateChunksConcurrent.
- Size the applier pool for concurrentSize + producer + headroom.

#1 Per-chunk round-trips (applier.go):
- ApplyIterationInsertQuery sent BEGIN / SET SESSION / INSERT / COMMIT as four
  round-trips per chunk. It now sends "SET SESSION ...; INSERT ..." as a single
  autocommit, multi-statement round-trip on one pinned connection. The applier
  pool already enables multiStatements + interpolateParams + autocommit;
  RowsAffected() reports the INSERT (last statement), and the optional
  SHOW WARNINGS runs on the same pinned connection. 4 round-trips -> 1.

#2 Persistent worker pool (migrator.go):
- Replace the per-batch errgroup+g.Wait barrier (which stalled N workers on
  the slowest chunk every N chunks) with continuous dispatch to an errgroup
  bounded by SetLimit(concurrentSize) for a 200ms time quantum. Workers stay
  saturated; the only barrier is at the quantum boundary. The time bound keeps
  executeWriteFuncs returning to apply binlog events and re-check throttling,
  preserving row-copy/event mutual exclusion.

Checkpoints record the last contiguous completed range (not the producer's
prefetched cursor), so resume restarts from fully-copied data.

Benchmarked on MySQL 8.0.46 (innodb_autoinc_lock_mode=2), 2.1M rows: copy time
vs the prior parallel impl improved up to 32% (chunk=200, conc=4: 22s->15s;
chunk=1000, conc=8: 8s->6s). Data integrity verified by row count + checksum.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant