Kafka transaction commit timeout exception handling

Kafka transaction commit timeout exception handling - java

I am running kafka transactions on a large scale and below is the codepiece.
producer.initTransaction();
try {
producer.beginTransaction();
producer.send(new ProducerRecord<>(producerTopic, element));
producer.commitTransaction();
} catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException e) {
producer.close();
canSendNext = false;
}catch (KafkaException e) {
producer.abortTransaction();
}
properties used:
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, STRING_SERIALIZER
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, BYTE_ARRAY_SERIALIZER
ProducerConfig.TRANSACTIONAL_ID_CONFIG, UUID.randomUUID().toString()
bootstrap.servers=localhost:9092
acks=all
retries=1
partitioner.class=org.apache.kafka.clients.producer.RoundRobinPartitioner
while the commitTransaction is timing out the catch block of KafkaException runs and try to abort the transaction.
Which fails with the error: **
Cannot attempt operation abortTransaction because the previous call to commitTransaction timed out and must be retried**
how to handle commit transaction timeout scenario
expecting the code to work

As per the documentation:
Note that this method will raise TimeoutException if the transaction
cannot be committed before expiration of max.block.ms. Additionally,
it will raise InterruptException if interrupted. It is safe to retry
in either case, but it is not possible to attempt a different
operation (such as abortTransaction) since the commit may already be
in the progress of completing. If not retrying, the only option is to
close the producer.
The problem here is that the kafka producer has timed out and it does not know whether the kafka broker will complete the transaction or not. So it cannot provide you abort transaction functionality as there is a chance that the transaction might have been committed by the broker even after the producer times out.
You should configure your kafka producer with a sufficiently large max.block.ms and good number of retries to handle such scenarios (Not sure why you have configured your retries to 1). Ideally you should be timing out very rarely - like when there is some issue in the network or there is some actual issue going on in kafka brokers.
In such scenarios, it won't be possible for you to know if your last transaction was actually successful or not. You cannot do anything but close your kafka producer and create a new one.

Related

Kafka streaming - TimeoutException: Expiring * record(s) for TOPIC:* ms has passed since batch creation

Streaming application is rolled out in production and right after 10 days observing errors/warnings in the CustomProductionExceptionHandler for expired transactions which belongs to older day window.
FLOW :
INPUT TOPIC --> STREAMING APPLICATION(Produces stats and emits after day window closed) --> OUTPUT TOPIC
Producer continuously trying to publish records to OUTPUT Topic which is already expired with older window and logs an error into CustomProductionExceptionHandler.
I have reduced batch size and kept default but this change is not yet promoted to production.
CustomProductionExceptionHandler Implementation: To Avoid streaming to die due to NeworkException,TimeOutException.
With this implementation producer does not retry and in case of any exceptions it does CONTINUE.. On other side upon returning FAIL.. stream thread dies and does not auto restart..Need suggestions..
public class CustomProductionExceptionHandler implements ProductionExceptionHandler {
#Override
public ProductionExceptionHandlerResponse handle(final ProducerRecord<byte[], byte[]> record,
final Exception exception) {
String recordKey = new String(record.key());
String recordVal = new String(record.value());
String recordTopic = record.topic();
logger.error("Kafka message marked as processed although it failed. Message: [{}:{}], destination topic: [{}]", recordKey,recordVal,recordTopic,exception);
return ProductionExceptionHandlerResponse.CONTINUE;
}
}
Exception:
2019-12-20 16:31:37.576 ERROR com.jpmc.gpg.exception.CustomProductionExceptionHandler.handle(CustomProductionExceptionHandler.java:19) kafka-producer-network-thread | profile-day-summary-generator-291e69b1-5a3d-4d49-8797-252c2ae05607-StreamThread-19-producerid - Kafka message marked as processed although it failed. Message: [{"statistics":{}], destination topic: [OUTPUT-TOPIC]
org.apache.kafka.common.errors.TimeoutException: Expiring * record(s) for TOPIC:1086149 ms has passed since batch creation
Trying to get answer for below questions.
1) Why producer is trying to publish older transactions to OUTPUT Topic for which day window is already closed?
Example - Producer is trying to send 12/09 day window transaction but current opened window is 12/20
2) Streaming threads could have been died without CustomProductionExceptionHandler -->
ProductionExceptionHandlerResponse.CONTINUE.
Do we have any way that Producer can do retries in case of NetworkException or TimeoutException and
then continue instead of stream thread die?
Problem of specifying ProductionExceptionHandlerResponse.CONTINUE in the
CustomProductionExceptionHandler is - In case of any exception it skips
that record publishing to output topic and proceed with next records. No Resiliency.

1) It's not really possible to answer this question without knowing what your program does. Note, that in general, Kafka Streams works on event-time and handle out-of-order data.
2) You can configure all internally used client of a Kafka Streams application (ie, consumer, producer, admin client, and restore consumer) by specifying the corresponding client configuration in the Properties you pass into KafkaStreams. If you wand different configs for different clients, you can prefix them accordingly, ie, producer.retries instead of retries. Check out the docs for more details: https://docs.confluent.io/current/streams/developer-guide/config-streams.html#ak-consumers-producer-and-admin-client-configuration-parameters

How to handle exceptions when connection to Cassandra fails?

I have my Cassandra sink configured as shown below:
ClusterBuilder secureCassandraSinkClusterBuilder = new ClusterBuilder() {
#Override
protected Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoints(props.getCassandraClusterUrlAll().split(","))
.withPort(props.getCassandraPort())
.withAuthProvider(new DseGSSAPIAuthProvider("HTTP"))
.build();
};
CassandraSink
.addSink(cassandraObjectStream)
.setClusterBuilder(secureCassandraSinkClusterBuilder)
.build()
.name("Cassandra-Sink");
Now when the connection to Cassandra is not configured properly, I get a NoHostAvailableException or when the connection unexpectedly drops, I get a ConnectionTimeOutException, or sometimes a WriteTimeoutException. This ultimately triggers a JobExecutionException and the whole Flink job terminates.
Where do I catch these Cassandra exceptions? Where are these thrown? I tried putting a try-catch block around the CassandraSink but that doesn't do it. I want to catch these exceptions and retry connecting to Cassandra in case of a connection time-out or retry writing to Cassandra in case of a write time-out.

AFAIK, you cannot try to catch these exceptions using CassandraSink.
One way to catch the exceptions like TimeoutException is to implement your own sink for Cassandra, but it may take a lot of time...
Another way is if you run your streaming job, you could set the task retry to more than 1 through StreamingExecutionEnvironment.setRestartStrategy, and enable the checkpoint so that the streaming job could continue working based on the last checkpoint. CassandraSink supports WAL, so the EXACTLY_ONCE can be achieved with checkpoint enabled.

Meaning of sendOffsetsToTransaction in Kafka 0.11

The new Kafka version (0.11) supports exactly once semantics.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
I've got a producer setup with kafka transactional code in java like this.
producer.initTransactions();
try {
producer.beginTransaction();
for (ProducerRecord<String, String> record : payload) {
producer.send(record);
}
Map<TopicPartition, OffsetAndMetadata> groupCommit = new HashMap<TopicPartition, OffsetAndMetadata>() {
{
put(new TopicPartition(TOPIC, 0), new OffsetAndMetadata(42L, null));
}
};
producer.sendOffsetsToTransaction(groupCommit, "groupId");
producer.commitTransaction();
} catch (ProducerFencedException e) {
producer.close();
} catch (KafkaException e) {
producer.abortTransaction();
}
I'm not quite sure how to use the sendOffsetsToTransaction and the the intended use case of it. AFAIK, consumer groups is a multithreaded read feature on consumer end.
javadoc says
" Sends a list of consumed offsets to the consumer group coordinator, and also marks those offsets as part of the current transaction. These offsets will be considered consumed only if the transaction is committed successfully. This method should be used when you need to batch consumed and produced messages together, typically in a consume-transform-produce pattern."
How would produce maintain a list of consumed offsets? Whats the point of it?

This is only relevant to workflows in which you are consuming and then producing messages based on what you consumed. This function allows you to commit offsets you consumed only if the downstream producing succeeds. If you consume data, process it somehow, and then produce the result, this enables transactional guarantees across the consumption/production.
Without transactions, you normally use Consumer#commitSync() or Consumer#commitAsync() to commit consumer offsets. But if you use these methods before you've produced with your producer, you will have committed offsets before knowing whether the producer succeeded sending.
So, instead of committing your offsets with the consumer, you can use Producer#sendOffsetsToTransaction() on the producer to commit the offsets instead. This sends the offsets to the transaction manager handling the transaction. It will commit the offsets only if the entire transactions—consuming and producing—succeeds.
(Note: when you send the offsets to commit, you should add 1 to the offset last read, so that future reads resume from the offset you haven't read. This is true regardless of whether you commit with the consumer or the producer. See: KafkaProducer sendOffsetsToTransaction need offset+1 to successfully commit current offset).

How to handle producer flow control in jms messaging while using apache qpid

I am trying to handle flow control situation on producer end.
I have a queue on a qpid-broker with a max queue-size set. Also have flow_stop_count and flow_resume_count set on the queue.
now at the producer keeps on continuously producing messages until this flow_stop_count is reached. Upon breach of this count, an exception is thrown which is handled by Exception listener.
Now sometime later the consumer on queue will catch up and the flow_resume_count will be reached. The question is how does the producer know of this event.
Here's a sample code of the producer
connection connection = connectionFactory.createConnection();
connection.setExceptionListenr(new MyExceptionListerner());
connection.start();
Session session = connection.createSession(false,Session.CLIENT_ACKNOWLEDGE);
Queue queue = (Queue)context.lookup("Test");
MessageProducer producer = session.createProducer(queue);
while(notStopped){
while(suspend){//---------------------------how to resume this flag???
Thread.sleep(1000);
}
TextMessage message = session.createTextMessage();
message.setText("TestMessage");
producer.send(message);
}
session.close();
connection.close();
and for the exception listener
private class MyExceptionListener implements ExceptionListener {
public void onException(JMSException e) {
System.out.println("got exception:" + e.getMessage());
suspend=true;
}
}
Now the exceptionlistener is a generic listener for exceptions, so it should not be a good idea to suspend the producer flow through that.
What I need is perhaps some method on the producer level , something like produer.isFlowStopped() which I can use to check before sending a message. Does such a functionality exist in qpid api.
There is some documentation on the qpid website which suggest this can be done. But I couldn't find any examples of this being done anywhere.
Is there some standard way of handling this kind of scenario.

From what I have read from the Apache QPid documentation it seems that the flow_resume_count and flow_stop_count will cause the producers to start getting blocked.
Therefore the only option would be to software wise to poll at regular intervals until the messages start flowing again.
Extract from here.
If a producer sends to a queue which is overfull, the broker will respond by instructing the client not to send any more messages. The impact of this is that any future attempts to send will block until the broker rescinds the flow control order.
While blocking the client will periodically log the fact that it is blocked waiting on flow control.
WARN AMQSession - Broker enforced flow control has been enforced
WARN AMQSession - Message send delayed by 5s due to broker enforced flow control
WARN AMQSession - Message send delayed by 10s due to broker enforced flow control
After a set period the send will timeout and throw a JMSException to the calling code.
ERROR AMQSession - Message send failed due to timeout waiting on broker enforced flow control.
From this documentation it implicates that the software managing the producer would then have to self manage. So basically when you receive an exception that the queue is overfull you will need to back off and most likely poll and reattempt to send your messages.

You can try setting the capacity (size in bytes at which the queue is thought to be full ) and flowResumeCapacity (the queue size at which producers are unflowed) properties for a queue.
send() will then be blocked if the size exceeds the capacity value.
You can have a look at this test case file in the repo to get an idea.

Producer flow control is not yet implemented on the JMS client.
See https://issues.apache.org/jira/browse/QPID-3388

ActiveMQ : dead letter queue keeps my messages order

I use ActiveMQ as a broker to deliver messages. Theses messages are intented to be written in a dabatase. Sometimes, the database is unreachable or down. In that case, I want to rollback my message to retry later this message and I want to continue reading other messages.
This code works fine, except one point : the rollbacked message is blocking me from reading the others :
private Connection getConnection() throws JMSException {
RedeliveryPolicy redeliveryPolicy = new RedeliveryPolicy();
redeliveryPolicy.setMaximumRedeliveries(3); // will retry 3 times to dequeue rollbacked messages
redeliveryPolicy.setInitialRedeliveryDelay(5 *1000); // will wait 5s to read that message
ActiveMQConnectionFactory connectionFactory = new ActiveMQConnectionFactory(user, password, url);
Connection connection = connectionFactory.createConnection();
((ActiveMQConnection)connection).setUseAsyncSend(true);
((ActiveMQConnection)connection).setDispatchAsync(true);
((ActiveMQConnection)connection).setRedeliveryPolicy(redeliveryPolicy);
((ActiveMQConnection)connection).setStatsEnabled(true);
connection.setClientID("myClientID");
return connection;
}
I create my session this way :
session = connection.createSession(true, Session.SESSION_TRANSACTED);
Rollback is easy to ask :
session.rollback();
Let's imagine I have 3 messages in my queue :
1: ok
2: KO (will need to be treated again : the message I want to rollback)
3: ok
4: ok
My consumer will do (linear sequence) :
commit 1
rollback 2
wait 5s
rollback 2
wait 5s
rollback 2
put 2 in dead letter queue (ActiveMQ.DLQ)
commit 3
commit 4
But I want :
commit 1
rollback 2
commit 3
commit 4
wait 5s
rollback 2
wait 5s
rollback 2
wait 5s
put 2 in dead letter queue (ActiveMQ.DLQ)
So, how can I configure my Consumer to delay my rollbacked messages later ?

This is actually expected behavior, because message retries are handled by the client, not the broker. So, since you have 1 session bound, and your retry policy is setup for the 3 retries before DLQ, then the entire retry process blocks that particular thread.
So, my first question is that if the database insert fails, wouldn't you expect all of the rest of your DB inserts to fail for a similar reason?
If not, the way to get around that is to set the retry policy for that queue to be 0 retries, with a specific DLQ, so that messages will fail immediately and go into the DLQ. Then have another process that pulls off of the DLQ every 5 seconds and reprocesses and/or puts it back in the main queue for processing.

Are you using the <strictOrderDispatchPolicy /> in the ActiveMQ XML config file? I'm not sure if this will affect the order of messages for redelivery or not. If you are using strict order dispatch, try commenting out that policy to see if that changes the behavior.
Bruce

I had same problem, i haven't found solution here so decided to post it here after i found one for people struggling with the same.
This is fixed prior to version 5.6 when you set property nonBlockingRedelivery to true in connection factory:
<property name="nonBlockingRedelivery" value="true" />

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Kafka transaction commit timeout exception handling - java

Related

Kafka streaming - TimeoutException: Expiring * record(s) for TOPIC:* ms has passed since batch creation

How to handle exceptions when connection to Cassandra fails?

Meaning of sendOffsetsToTransaction in Kafka 0.11

How to handle producer flow control in jms messaging while using apache qpid

ActiveMQ : dead letter queue keeps my messages order

Categories

Resources