KAFKA-19883: [DRAFT] KIP-1289 — Transactional acknowledgments for share groups (state machine + client API + wire schema) by Shekharrajak · Pull Request #22357 · apache/kafka

Shekharrajak · 2026-05-23T18:19:16Z

Share groups (KIP-932) have no equivalent of sendOffsetsToTransaction. There is no way to atomically bind a share-group acknowledgment to a producer transaction, blocking EOS for share-group-based read-process-write pipelines.

Description

This is the foundational layer for KIP-1289. It does not include the broker-side RPC handler (follow-up PR).

RecordState — new TX_PENDING state (server module)
InFlightState — transactional staging (server module)
ShareGroupMetadata (clients module) - ShareConsumer.shareGroupMetadata() — fires a background event to read live membership state from ShareMembershipManager
Wire schema (clients module)- TxnShareAcknowledgeRequest / TxnShareAcknowledgeResponse (apiKey 93, v0)
Producer EOS flow (clients module) - Producer.sendShareAcknowledgementsToTransaction(

tests

• RecordStateTest — full 6-state transition matrix including all TX_PENDING paths
• InFlightStateTxnTest

follow up

• Broker-side handleTransactionnShareAcknowledge handler in Kafka Apis.scala
• ShareCoordinatorShard.replayEndTransactionMarker() real body
• SharePartition TX_PENDING acquisition exclusion
• IT tests (require broker handler)

…tional acks

…te for KIP-1289

…mas for KIP-1289

…nd wire shareGroupMetadata() for KIP-1289

Shekharrajak · 2026-05-24T04:09:56Z

    ARCHIVING((byte) 3),    // Per KIP-1191
-    ARCHIVED((byte) 4);
+    ARCHIVED((byte) 4),
+    TX_PENDING((byte) 5);   // Per KIP-1289: staged into an open producer transaction


TX_PENDING is the bridge state that holds the record - between aquired and ack (or archived)

Shekharrajak · 2026-05-24T04:13:34Z


-        // Either the transition is from Available -> Acquired or from Acquired -> Available/
-        // Acknowledged/Archived.
+        if (newState == TX_PENDING && this != ACQUIRED) {


validation for tx_pending state - it should be after AQUIRED , before acknowledged or archiving

Shekharrajak · 2026-05-24T04:18:14Z

+        if (state != RecordState.TX_PENDING) {
+            return null;
+        }
+        if (this.stagedProducerId != producerId || this.stagedProducerEpoch != producerEpoch) {


we need fencing to ensure only the right producer can confirm.
The confirmation can come from different RPC call from a different broker/server.

Shekharrajak · 2026-05-24T04:22:27Z

+            topic.partitions().add(partition);
+        }
+
+        TxnShareAcknowledgeRequestData data = new TxnShareAcknowledgeRequestData()


TxnShareAcknowledgeRequest will be sent from produer at client side with all producer deails

Shekharrajak · 2026-05-24T04:49:27Z

+            for (AcknowledgementBatch b : entry.getValue()) {
+                for (byte ackType : b.acknowledgeTypes()) {
+                    batches.add(new TxnShareAcknowledgeBatch()
+                        .setFirstOffset(b.firstOffset())


(firstOffset, lastOffset, ackType) triples -> this will help in pointing to records which is already on the broker's log.

Shekharrajak · 2026-05-24T05:52:32Z

+                "(currentState= " + currentState + ")");
+        }
+
+        TxnRequestHandler handler;


Same implementation as sendOffsetsToTransaction() transactionManager.sendOffsetsToTransaction(offsets, groupMetadata);

Shekharrajak · 2026-05-24T06:19:59Z

Drafting txn Coord side of the changes.

…r for KIP-1289

…pplyTxnMarker for KIP-1289

…k in KafkaApis for KIP-1289

Shekharrajak · 2026-05-24T07:19:11Z

+        Map<TopicIdPartition, CompletableFuture<Throwable>> futures = new HashMap<>();
+        acknowledgeTopics.forEach((topicIdPartition, acknowledgePartitionBatches) -> {
+            SharePartitionKey sharePartitionKey = sharePartitionKey(groupId, topicIdPartition);
+            SharePartition sharePartition = partitionCache.get(sharePartitionKey);


per-partition dispatch

sharePartition.stageTxnAcknowledge(memberId, producerId, producerEpoch, acknowledgePartitionBatches)

Shekharrajak · 2026-05-24T07:23:59Z

+
+    public void applyTxnMarker(long producerId, short producerEpoch, TransactionResult result) {
+        log.debug("Broadcasting txn marker producerId={} epoch={} result={}", producerId, producerEpoch, result);
+        partitionCache.values().forEach(sp -> sp.applyTxnMarker(producerId, producerEpoch, result));


iterates every SharePartition on this broker, each iterates its InFlightStates, each checks (state == TX_PENDING && stagedProducerId == producerId && stagedProducerEpoch == epoch) and apply the marker for all the share partitions.

This can be improved but InFlightState.applyTxnMarker is O(1) operation

Shekharrajak · 2026-05-24T07:26:49Z

    var skippedMarkers = 0
    for (marker <- markers.asScala) {
      val producerId = marker.producerId
+      sharePartitionManager.applyTxnMarker(producerId, marker.producerEpoch, marker.transactionResult)


After auth check for each marker calling share parititon manager - since share partition manager is responsible for finding share partition and apply

Shekharrajak · 2026-05-24T07:35:27Z

+    ) {
+        for (Map.Entry<Long, InFlightBatch> entry : subMap.entrySet()) {
+            InFlightBatch inFlightBatch = entry.getValue();
+            if (inFlightBatch.lastOffset() < startOffset) continue;


below startOffset - already acknowledged

Shekharrajak · 2026-05-24T07:41:07Z

+        return future;
+    }
+
+    private Throwable stageBatchTxnRecords(


return throwable : signalling pattern

Shekharrajak · 2026-05-24T07:44:57Z

+                }
+
+                throwable = stageBatchTxnRecords(memberId, producerId, producerEpoch, batch, ackTypeMap, subMap);
+                if (throwable != null) break;


break if there is one stage txn failed with any exception - if previous ones are success, then those updated with txn_pending state but current txn batch not.

transaction will abort and applyTxnMarker(ABORT) will revert them to AVAILABLE both batch ( batch 1, which was succeed and batch 2 which got failed in some step - and after the txn timeout it will also be aborted)

# user code : try { producer.sendShareAcknowledgementsToTransaction(acks, shareGroupMeta); // ← throws producer.commitTransaction(); } catch (KafkaException e) { producer.abortTransaction(); // ← this is what reverts the records state back to }

We must have same semantics as KafkaProducer.sendOffsetsToTransaction

Retry contract (catch KafkaException → abortTransaction() → retry loop)
• Distinction between fatal vs abortable vs retriable errors
• Producer reuse semantics after abort

Transactional REJECT received
-> validate ownership/session/producer fencing
-> persist pending transactional reject
-> no DLQ write yet

Transaction COMMIT marker
-> materialize REJECT
-> if DLQ disabled: ARCHIVED
-> if DLQ enabled: ARCHIVING, then DLQ enqueue, then ARCHIVED

Transaction ABORT marker
-> discard pending REJECT
-> no DLQ write
-> record becomes retryable according to share lock/member rules

AndrewJSchofield · 2026-05-24T14:49:42Z

Thanks for the PR.

Personally, I would prefer if there was no PR for such a complicated KIP until it has successfully passed a vote. Without serious review by committers knowledgeable in the areas of share groups and transactions, the KIP would not yet be able to pass a vote. It will need +3 binding votes.

There are many details which need to be agreed by the committers before we are there. For example, we will need to have a new transaction.version and new share.version to ensure that we correctly police these schema changes to the information on the internal metadata topics, and only enable the feature once all brokers in the cluster are ready. We would typically break down the code into many PRs so that the chunks are reviewable with reasonable effort. For example, I would do the feature changes first, then the JSON RPC schemas, and then gradually introduce the actual logic. I suggest KIP-1191 and https://issues.apache.org/jira/browse/KAFKA-19469 as a good example of breaking the code down and introducing it progressively (that's one I'm actively reviewing for AK 4.4).

Finally, we would not accept a KIP like this directly into Apache Kafka as a GA feature without some kind of Early Access or Preview release, so the feature bump also allows us to pace its enablement with the successful completion of an extended period of system testing.

My suggestion would be to close the PR, concentrate on fleshing out the KIP and building alignment with committers.

Shekharrajak · 2026-06-18T19:23:11Z

            return DELIVERY_STATE_AVAILABLE;
        }
        if (batch.stagedDeliveryState() != PersisterStateBatch.NO_STAGED_DELIVERY_STATE) {
-            if (batch.stagedDeliveryState() == DELIVERY_STATE_ARCHIVING) {


For transactional REJECT with DLQ enabled, the coordinator must preserve staged ARCHIVING on commit.
The source SharePartition owns DLQ phase 2; after reload it resumes ARCHIVING, writes
the DLQ record, and then archives the share-state batch.

Shekharrajak · 2026-06-18T19:27:26Z

+                    topicIdPartition
+                );
+            }
+            if (topicIdPartitionByShareStatePartition.size() == sgsTopicPartitions.size()) {


one producer transaction can include share acks whose source partitions map to multiple __share_group_state partitions. So this helper method will help in IT

We must try out some better way of doing it.

Shekharrajak · 2026-06-18T19:29:57Z

+            assertEquals(0, shareConsumer.poll(Duration.ofMillis(1000)).count());
+            verifySharePartitionLag(admin, groupId, tp, 0L);
+            waitForDlqRecords(dlqTopic, 1);
+            verifyLatestShareStateDeliveryState(groupId, acknowledgedPartition, 0L, RecordState.ARCHIVED);


proves commit+reject with DLQ enabled writes one DLQ record and ends in ARCHIVED.

Shekharrajak · 2026-06-18T19:30:22Z

            return DELIVERY_STATE_AVAILABLE;
        }
        if (batch.stagedDeliveryState() != PersisterStateBatch.NO_STAGED_DELIVERY_STATE) {
-            if (batch.stagedDeliveryState() == DELIVERY_STATE_ARCHIVING) {


preserves staged delivery state instead of converting ARCHIVING to ARCHIVED.

Shekharrajak · 2026-06-18T19:34:36Z

+        sharePartition.applyTxnMarker(100L, (short) 1, TransactionResult.COMMIT).join();
+        sharePartition.applyTxnMarker(100L, (short) 1, TransactionResult.COMMIT).join();
+
+        assertTrue(sharePartition.cachedState().isEmpty());


covers duplicate marker idempotency.

Shekharrajak · 2026-06-19T14:11:25Z

     * Prepare a transaction for a two-phase commit.
     * This transitions the transaction to the PREPARED_TRANSACTION state.
-     * The preparedTxnState is set with the current producer ID and epoch.
+     * The preparedTxnState is set with the current transaction owner fence.


Share ack transaction staging/finalization internals now use txnOwnerId / txnOwnerEpoch across SharePartition, SharePartitionManager, InFlightState, InFlightBatch, and share coordinator
completion paths.

This will help in external procesing engine to use it as txn and engine coord will run commit during checkpoint or completion of all the jobs

…hare-groups

Shekharrajak · 2026-06-24T19:00:16Z

            try {
                assertThrows(CommitFailedException.class,
-                    () -> transactionalProducer.sendShareAcknowledgementsToTransaction(acknowledgements, staleGroupMetadata));
+                    () -> transactionalProducer.sendShareAcknowledgementsToTransaction(acknowledgements, fencedGroupMetadata));


the request is rejected because its epoch does not match coordinator-owned membership state.

old epoch lower than current -> stale request

epoch equal to current -> valid request

previous epoch -> tolerated in Kafka share group logic

epoch higher than current -> impossible client claim, reject/fence it

Shekharrajak · 2026-06-26T14:49:33Z

+            List<ConsumerRecord<byte[], byte[]>> outputRecords = readCommittedRecords(outputTopicPartition, 1);
+            ConsumerRecord<byte[], byte[]> outputRecord = outputRecords.get(0);
+            assertEquals(0L, outputRecord.offset());
+            assertEquals(outputValue, new String(outputRecord.value(), StandardCharsets.UTF_8));


validation : one Kafka transaction can include both a normal output record and share acknowledgements.

Shekharrajak · 2026-06-26T14:49:56Z

+            verifySharePartitionLag(admin, groupId, inputTopicPartition, 1L);
+            ConsumerRecords<byte[], byte[]> redeliveredRecords = waitedPoll(shareConsumer, 2500L, 1);
+            ConsumerRecord<byte[], byte[]> redeliveredRecord = redeliveredRecords.iterator().next();
+            assertEquals(0L, redeliveredRecord.offset());


validation : one Kafka transaction can include both a normal output record and share acknowledgements.
This is abort case.

output stays hidden from read_committed, share lag remains 1, and the input record is redelivered.

Shekharrajak · 2026-06-26T16:50:39Z

+    val recoveredProducer = transactionalProducer(transactionalId)
+    try {
+      recoveredProducer.initTransactions(true)
+      recoveredProducer.completeTransaction(new PreparedTxnState(preparedState.toString))


recovered commit .

Client recovery now sets the active transaction owner and marks the transaction as started, so completeTransaction(...) sends EndTxn

Shekharrajak · 2026-06-26T16:53:00Z

+    val recoveredProducer = transactionalProducer(transactionalId)
+    try {
+      recoveredProducer.initTransactions(true)
+      recoveredProducer.completeTransaction(new PreparedTxnState(s"${preparedState.txnOwnerId + 1}:${preparedState.txnOwnerEpoch}"))


abort flow .

txnOwnerId do not match, so it calls abortTransaction()

Shekharrajak · 2026-06-26T16:56:50Z

      // 2PC functionality is disabled, clients that attempt to use this functionality
      // would receive an authorization failed error.
      responseCallback(initTransactionError(Errors.TRANSACTIONAL_ID_AUTHORIZATION_FAILED))
-    } else if (keepPreparedTxn) {


Broker no longer rejects keepPreparedTxn unconditionally

Shekharrajak · 2026-06-26T16:57:44Z


        public Builder(InitProducerIdRequestData data) {
-            super(ApiKeys.INIT_PRODUCER_ID);
+            super(


Client requires InitProducerId v6 when enable2Pc or keepPreparedTxn is set

Shekharrajak · 2026-06-26T16:58:41Z

-              expectedProducerIdAndEpoch)
-          )
+          txnMetadata.inLock(() => {
+            if (keepPreparedTxn && txnMetadata.state == TransactionState.ONGOING) {


Coordinator preserves only existing ONGOING 2PC transactions and returns the ongoing producer id/epoch for recovery.

Shekharrajak added 4 commits May 23, 2026 23:17

KAFKA-19883: Add TX_PENDING state to RecordState for KIP-1289 transac…

782f6df

…tional acks

KAFKA-19883: Add txn staging fields and applyTxnMarker to InFlightSta…

f1fbb13

…te for KIP-1289

KAFKA-19883: Add ShareGroupMetadata and TxnShareAcknowledge wire sche…

5c03d8d

…mas for KIP-1289

KAFKA-19883: Add sendShareAcknowledgementsToTransaction to producer a…

3b5c547

…nd wire shareGroupMetadata() for KIP-1289

github-actions Bot added triage PRs from the community core Kafka Broker producer consumer clients labels May 23, 2026

Shekharrajak commented May 24, 2026

View reviewed changes

github-actions Bot removed the triage PRs from the community label May 24, 2026

Shekharrajak commented May 24, 2026

View reviewed changes

Shekharrajak added 3 commits May 24, 2026 12:29

KAFKA-19883: Add SharePartition.stageTxnAcknowledge and applyTxnMarke…

555b065

…r for KIP-1289

KAFKA-19883: Add SharePartitionManager.acknowledgeTransactional and a…

02b73ef

…pplyTxnMarker for KIP-1289

KAFKA-19883: Wire TxnShareAcknowledge handler and WriteTxnMarkers hoo…

f0ce656

…k in KafkaApis for KIP-1289

github-actions Bot added the KIP-932 Queues for Kafka label May 24, 2026

Shekharrajak commented May 24, 2026

View reviewed changes

Shekharrajak changed the title ~~KAFKA-19883: KIP-1289 — Transactional acknowledgments for share groups (state machine + client API + wire schema)~~ KAFKA-19883: [DRAFT] KIP-1289 — Transactional acknowledgments for share groups (state machine + client API + wire schema) May 24, 2026

Shekharrajak marked this pull request as draft May 24, 2026 15:12

Shekharrajak added 6 commits June 18, 2026 22:30

Add multi share-state txn ack IT

2fea1dc

Cover share ack drain semantics

49136c3

Test share txn coordinator failover

a8b5fe9

Assert remote share marker refresh

f737fe5

Fix transactional reject DLQ flow

04961cc

Cover share marker retries

7c8dbd9

Shekharrajak commented Jun 18, 2026

View reviewed changes

Shekharrajak added 2 commits June 19, 2026 19:23

Expose prepared txn owner state

6fb4175

Rename share ack txn owner

95c654e

Shekharrajak commented Jun 19, 2026

View reviewed changes

Shekharrajak mentioned this pull request Jun 21, 2026

FLIP-573: [DRAFT] Queues for Kafka apache/flink-connector-kafka#271

Draft

Shekharrajak added 3 commits June 24, 2026 23:13

Merge remote-tracking branch 'upstream/trunk' into kip-1289-txn-ack-s…

7e7e1b9

…hare-groups

Fence stale transactional share acknowledgements

43f07ea

Clarify fenced share member epoch test

8e7687d

Shekharrajak commented Jun 24, 2026

View reviewed changes

Add same-transaction share ack tests

936bf1c

Shekharrajak commented Jun 26, 2026

View reviewed changes

Shekharrajak added 2 commits June 26, 2026 22:16

Enable prepared transaction recovery

425f1de

Add prepared transaction recovery test

7f01976

github-actions Bot added the transactions Transactions and EOS label Jun 26, 2026

Shekharrajak commented Jun 26, 2026

View reviewed changes

Uh oh!

Conversation

Shekharrajak commented May 23, 2026

Description

tests

follow up

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shekharrajak commented May 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shekharrajak May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndrewJSchofield commented May 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Shekharrajak May 24, 2026 •

edited

Loading

Shekharrajak Jun 10, 2026 •

edited

Loading