-
Notifications
You must be signed in to change notification settings - Fork 15.3k
KAFKA-20514: kraft observers should use previous fetch response to decide where to send the next fetch #22111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Changes from 5 commits
8d99a03
a4ee904
48bb14b
325e11b
9d0e220
0bce334
d20007b
c5319ef
2bc3f54
7ab818a
493e2e0
4c72c98
4e91f4a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -106,6 +106,11 @@ public long remainingFetchTimeMs(long currentTimeMs) { | |
| return fetchTimer.remainingMs(); | ||
| } | ||
|
|
||
| public long remainingUpdateVoterSetTimeMs(long currentTimeMs) { | ||
| updateVoterSetPeriodTimer.update(currentTimeMs); | ||
| return updateVoterSetPeriodTimer.remainingMs(); | ||
| } | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See my other comment but ideally we should not have this method if is not used by the kraft implementation.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good, I can remove this. |
||
|
|
||
| public int leaderId() { | ||
| return leaderId; | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1725,6 +1725,13 @@ private boolean handleFetchResponse( | |
| leaderEndpoints = Endpoints.empty(); | ||
| } | ||
|
|
||
| maybeSwitchObserverFetchToLeader( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this transition not already handled by maybeHandleCommonResponse?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we store the leader information when becoming |
||
| responseEpoch, | ||
| responseLeaderId, | ||
| leaderEndpoints, | ||
| currentTimeMs | ||
| ); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you document the exact RPC trace for this issue? Can you document trace in trunk vs the trace in this branch? I am interested in the RPCs and kraft state transitions.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sure, the trace is below. It is a very esoteric edge case, so let me know if you have any questions. Assume that the observer is unable to establish a TCP connection with the leader endpoint X. This means that The observer starts up either in the On trunk:
On this branch:
The motivation for the proposed behavior is that on
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do you need this special handler for FETCH? How about the other RPCs that follower send like FETCH_SNAPSHOT and UPDATE_VOTER?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am checking: in this method. This handler and logic only applies to nodes in the |
||
|
|
||
| Optional<Boolean> handled = maybeHandleCommonResponse( | ||
| error, | ||
| responseLeaderId, | ||
|
|
@@ -2608,6 +2615,32 @@ private boolean hasConsistentLeader(int epoch, OptionalInt leaderId) { | |
| } | ||
| } | ||
|
|
||
| /** | ||
| * If the local replica is an observer currently routing fetches to bootstrap servers, | ||
| * and a non-leader source's fetch response advertises the leader's endpoints, switch | ||
| * the observer back to fetching from the leader. | ||
| */ | ||
| private void maybeSwitchObserverFetchToLeader( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: slightly more descriptive name might be a tad better - for example |
||
| int responseEpoch, | ||
| OptionalInt responseLeaderId, | ||
| Endpoints leaderEndpoints, | ||
| long currentTimeMs | ||
| ) { | ||
| if (!hasConsistentLeader(responseEpoch, responseLeaderId)) { | ||
| throw new IllegalStateException("Received request or response with leader " + responseLeaderId + | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: we might also wish to include
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I followed the other usage of |
||
| " and epoch " + responseEpoch + " which is inconsistent with current leader " + | ||
| quorum.leaderId() + " and epoch " + quorum.epoch()); | ||
| } else if (responseEpoch == quorum.epoch() && quorum.isUnattached() && | ||
| responseLeaderId.isPresent() && !leaderEndpoints.isEmpty()) { | ||
| transitionToFollower( | ||
| responseEpoch, | ||
| responseLeaderId.getAsInt(), | ||
| leaderEndpoints, | ||
| currentTimeMs | ||
| ); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Handle response errors that are common across request types. | ||
| * | ||
|
|
@@ -3390,6 +3423,9 @@ private long pollFollowerAsObserver(FollowerState state, long currentTimeMs) { | |
| state.resetUpdateVoterSetPeriod(currentTimeMs); | ||
| } | ||
| return sendResult.timeToWaitMs(); | ||
| } else if (state.hasFetchTimeoutExpired(currentTimeMs)) { | ||
| transitionToUnattached(state.epoch(), OptionalInt.of(state.leaderId())); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should add a test which explicitly checks that the observer can transition to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added this to the existing unit test. |
||
| return 0L; | ||
| } else { | ||
| return maybeSendFetchToBestNode(state, currentTimeMs); | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -38,6 +38,7 @@ | |
|
|
||
| import java.nio.ByteBuffer; | ||
| import java.util.List; | ||
| import java.util.Map; | ||
| import java.util.Optional; | ||
| import java.util.OptionalInt; | ||
| import java.util.OptionalLong; | ||
|
|
@@ -765,4 +766,75 @@ void testUpdatedHighWatermarkCompleted() throws Exception { | |
| assertEquals(localLogEndOffset, partitionResponse.highWatermark()); | ||
| } | ||
| } | ||
|
|
||
| @Test | ||
| void testObserverFetchesBetweenLeaderAndBootstrapServers() throws Exception { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this fail without your change? If so, can you tell me exactly what fails?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, Looking at this again after a while, the for loop actually made this harder for me to read. I'm going to remove it.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I ran this test against this PR and I got this trace: Can we get a TRACE of the actual issue to make sure we are solving the correct problem? I am having a hard time understanding the actual problem so I am sure that this change solves that problem.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here is an updated trace from my most recent local changes: The scenario is: after the local node becomes |
||
| final var epoch = 2; | ||
| final var local = KafkaRaftClientTest.replicaKey( | ||
| KafkaRaftClientTest.randomReplicaId(), | ||
| true | ||
| ); | ||
| final var leader = KafkaRaftClientTest.replicaKey(local.id() + 1, true); | ||
| final var otherVoter = KafkaRaftClientTest.replicaKey(local.id() + 2, true); | ||
|
|
||
| final var voters = VoterSet.fromMap( | ||
| Map.of( | ||
| leader.id(), VoterSetTest.voterNode(leader), | ||
| otherVoter.id(), VoterSetTest.voterNode(otherVoter) | ||
| ) | ||
| ); | ||
|
|
||
| final var context = new RaftClientTestContext.Builder( | ||
| local.id(), | ||
| local.directoryId().get() | ||
| ) | ||
| .withStaticVoters(voters) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why use static voters? Why not enable and use all of the latest features? |
||
| .withBootstrapServers(Optional.of(List.of(RaftClientTestContext.mockAddress(otherVoter.id())))) | ||
| .withRaftProtocol(RaftClientTestContext.RaftProtocol.KIP_1166_PROTOCOL) | ||
| .build(); | ||
|
|
||
| for (int i = 0; i < 10; ++i) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a specific reason why this loop must run 10 times? It would be nice to add a comment clarifying why there must be 10 iterations.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not really, 10 is just a magic number. I just want to make sure we alternate between bootstrap endpoint + leader endpoint continuously. |
||
| // The observer initially fetches from the bootstrap servers, where it will discover the leader's endpoints. | ||
| context.pollUntilRequest(); | ||
| final var bootstrapFetch = context.assertSentFetchRequest(); | ||
| assertEquals(-2, bootstrapFetch.destination().id()); | ||
| assertEquals(RaftClientTestContext.mockAddress(otherVoter.id()).getHostName(), bootstrapFetch.destination().host()); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: test would be clearer if we changed the variable name from
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
| assertEquals(RaftClientTestContext.mockAddress(otherVoter.id()).getPort(), bootstrapFetch.destination().port()); | ||
|
|
||
| context.deliverResponse( | ||
| bootstrapFetch.correlationId(), | ||
| bootstrapFetch.destination(), | ||
| context.fetchResponse( | ||
| epoch, | ||
| leader.id(), | ||
| MemoryRecords.EMPTY, | ||
| 0L, | ||
| Errors.NOT_LEADER_OR_FOLLOWER | ||
| ) | ||
| ); | ||
|
|
||
| // Subsequent fetch from the observer is sent to the leader | ||
| context.pollUntilRequest(); | ||
| final var leaderFetch = context.assertSentFetchRequest(); | ||
| assertEquals(leader.id(), leaderFetch.destination().id()); | ||
| assertEquals(RaftClientTestContext.mockAddress(leader.id()).getHostName(), leaderFetch.destination().host()); | ||
| assertEquals(RaftClientTestContext.mockAddress(leader.id()).getPort(), leaderFetch.destination().port()); | ||
|
|
||
| // Return a BROKER_NOT_AVAILABLE error, and then advance time past the fetch timeout, | ||
| // which should cause the observer to fetch from the bootstrap servers on the next fetch. | ||
|
|
||
| // The fetch timeout is much greater than the request manager's configured backoff, so the | ||
| // current unreachable connection will no longer be backing off when the next fetch is sent. | ||
| context.deliverResponse( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC in the actual failure mode which causes this bug, we never get a response from the leader at all. It's completely disconnected. is that the same as receiving a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, it is the same. The specific case is when the local node cannot establish a TCP connection with the leader endpoint. This causes the destination node to be considered "disconnected" and the RPC is never sent over the wire. If you trace through the In my opinion, there is less of a clear motivation to change the above behavior compared to allowing the observer to behave more like a voter. What do you think? |
||
| leaderFetch.correlationId(), | ||
| leaderFetch.destination(), | ||
| RaftUtil.errorResponse( | ||
| ApiKeys.FETCH, | ||
| Errors.BROKER_NOT_AVAILABLE | ||
| ) | ||
| ); | ||
| context.client.poll(); | ||
| context.time.sleep(context.fetchTimeoutMs + 1); | ||
| } | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2011,7 +2011,7 @@ public void testFollowerAsObserverDoesNotBecomeProspectiveAfterFetchTimeout(bool | |
|
|
||
| context.time.sleep(context.fetchTimeoutMs); | ||
| context.pollUntilRequest(); | ||
| assertTrue(context.client.quorum().isFollower()); | ||
| assertFalse(context.client.quorum().isProspective()); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: do we with to add an assert for whether it becomes
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added. |
||
|
|
||
| // transitions to unattached | ||
| context.deliverRequest(context.voteRequest(epoch + 1, replicaKey(otherNodeId, withKip853Rpc), epoch, 1)); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this change for? it's only used in the testContext?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to fix some logic in
RaftClientTestContext#advanceTimeAndCompleteFetch, so that both timers won't be expired on a subsequent invocation of the helper method. I was originally using this method to write my test case, but ended up changing the test to not use this helper. I can remove it from the PR if it is confusing things.