Kafka Transactions are not rolling back after interrupting current thread on timeout - java

I'm trying to interrupt current thread if http request times out. I have setup PlatformTransactionManager for Kafka Transactions as a bean. I'm using #Transactional annotation at method level. We are publishing message in 3 topics. After publishing message in first topic I'm putting Thread.sleep(5000) and current thread is interrupting from filter if execution takes more than 6 seconds. So here call is getting interrupted but message is getting published to Kafka. We are just producing the message. We are not consuming any message but able to see message in our internal Kafka Inspection Tool. We are using KafkaTemplate.send() to send message.

Producer records are always written to the log, even if rolled back. There is a special record in the slot following the published record(s) to indicate whether the transaction was committed or rolled back.
Consumers' isolation.level is read_uncommitted by default; you need to set it to read_committed to avoid seeing rolled-back records.
https://kafka.apache.org/documentation/#consumerconfigs_isolation.level

Related

TooManyMessagesWithoutAckException while processing kafka message in quarkus

In a quarkus process we're performing below steps once the message is polled from kafka
Thread.sleep(30000) - Due to business logic
call a 3rd party API
call another 3rd party api
Inserting data in db
Once almost everyday the process hangs after throwing TooManyMessagesWithoutAckException.
2022-12-02 20:02:50 INFO [2bdf7fc8-e0ad-4bcb-87b8-c577eb506b38, ] : Going to sleep for 30 sec.....
2022-12-02 20:03:20 WARN [ kafka] : SRMSG18231: The record 17632 from topic-partition '<partition>' has waited for 60 seconds to be acknowledged. This waiting time is greater than the configured threshold (60000 ms). At the moment 2 messages from this partition are awaiting acknowledgement. The last committed offset for this partition was 17631. This error is due to a potential issue in the application which does not acknowledged the records in a timely fashion. The connector cannot commit as a record processing has not completed.
2022-12-02 20:03:20 WARN [ kafka] : SRMSG18228: A failure has been reported for Kafka topics '[<topic name>]': io.smallrye.reactive.messaging.kafka.commit.KafkaThrottledLatestProcessedCommit$TooManyMessagesWithoutAckException: The record 17632 from topic/partition '<partition>' has waited for 60 seconds to be acknowledged. At the moment 2 messages from this partition are awaiting acknowledgement. The last committed offset for this partition was 17631.
2022-12-02 20:03:20 INFO [2bdf7fc8-e0ad-4bcb-87b8-c577eb506b38, ] : Sleep over!
Below is an example on how we are consuming the messages
#Incoming("my-channel")
#Blocking
CompletionStage<Void> consume(Message<Person> person) {
String msgKey = (String) person
.getMetadata(IncomingKafkaRecordMetadata.class).get()
.getKey();
// ...
return person.ack();
}
As per the logs only 30 seconds have passed since the event was polled but the exception of kafka acknowledgement not being sent for 60 second is thrown.
I checked whole day's log when the error was thrown to see if the REST api calls took more than 30 seconds to fetch the data, but I wasn't able to find any.
We haven't done any specific kafka configuration other than topic name, channel name, serializer, deserializer, group id and managed kafka connection details.
There are 4 partitions in this topic with replication factor of 3. There are 3 pods running for this process.
We're unable to reproduce to this issue in Dev and UAT environments.
I checked configuration options which but couldn't find any configuration which might help : Quarkus Kafka Reference
mp:
messaging:
incoming:
my-channel:
topic: <topic>
group:
id: <group id>
connector: smallrye-kafka
value:
serializer: org.apache.kafka.common.serialization.StringSerializer
deserializer: org.apache.kafka.common.serialization.StringDeserializer
Is it possible that quarkus is acknowledging the messages in batches and by that time the waiting time has already reached the threshold?
Please comment if there are any other possibilities for this issue.
I have similiar issues on our production environment running different quarkus services with a simple 3-Node-Kafka-Cluster and I researched the problem a lot - with no clear answer. At the moment, I have two approaches to this problem:
Make sure, you really ack or nack the kafka-message in your code. Is really every exception catched and answered with a "person.nack(exception);" (or a "person.ack(()" - depends on your failure strategy)? Make sure it is. The error Throttled-Exception is thrown, if no ack() OR nack() is performed. The problem occurres mostly, if nothing happens at all.
When this does not help, I switch the commit-strategy to "latest":
mp.messaging.incoming.my-channel.commit-strategy=latest
This is a little slower, because the batch commit is disabled, but runs stable in my case. If you don't know about commit strategies and the default, catch up with the good article by Escoffier:
I am aware, that this does not solve the root-cause, but helped in desperate times. The problem has to be, that one or more queued messages are not acknowledged in time, but I can't tell you why. Maybe the application logic is too slow, but I have a hard time - like you - to reproduce this locally. You can also try to increase the threshold of 60 sec with throttled.unprocessed-record-max-age.ms and a see for yourself, if this helps. In my case, it did not. Maybe someone else can share his insights with this problem and can provide you with a real solution.

Kafka streaming - TimeoutException: Expiring * record(s) for TOPIC:* ms has passed since batch creation

Streaming application is rolled out in production and right after 10 days observing errors/warnings in the CustomProductionExceptionHandler for expired transactions which belongs to older day window.
FLOW :
INPUT TOPIC --> STREAMING APPLICATION(Produces stats and emits after day window closed) --> OUTPUT TOPIC
Producer continuously trying to publish records to OUTPUT Topic which is already expired with older window and logs an error into CustomProductionExceptionHandler.
I have reduced batch size and kept default but this change is not yet promoted to production.
CustomProductionExceptionHandler Implementation: To Avoid streaming to die due to NeworkException,TimeOutException.
With this implementation producer does not retry and in case of any exceptions it does CONTINUE.. On other side upon returning FAIL.. stream thread dies and does not auto restart..Need suggestions..
public class CustomProductionExceptionHandler implements ProductionExceptionHandler {
#Override
public ProductionExceptionHandlerResponse handle(final ProducerRecord<byte[], byte[]> record,
final Exception exception) {
String recordKey = new String(record.key());
String recordVal = new String(record.value());
String recordTopic = record.topic();
logger.error("Kafka message marked as processed although it failed. Message: [{}:{}], destination topic: [{}]", recordKey,recordVal,recordTopic,exception);
return ProductionExceptionHandlerResponse.CONTINUE;
}
}
Exception:
2019-12-20 16:31:37.576 ERROR com.jpmc.gpg.exception.CustomProductionExceptionHandler.handle(CustomProductionExceptionHandler.java:19) kafka-producer-network-thread | profile-day-summary-generator-291e69b1-5a3d-4d49-8797-252c2ae05607-StreamThread-19-producerid - Kafka message marked as processed although it failed. Message: [{"statistics":{}], destination topic: [OUTPUT-TOPIC]
org.apache.kafka.common.errors.TimeoutException: Expiring * record(s) for TOPIC:1086149 ms has passed since batch creation
Trying to get answer for below questions.
1) Why producer is trying to publish older transactions to OUTPUT Topic for which day window is already closed?
Example - Producer is trying to send 12/09 day window transaction but current opened window is 12/20
2) Streaming threads could have been died without CustomProductionExceptionHandler -->
ProductionExceptionHandlerResponse.CONTINUE.
Do we have any way that Producer can do retries in case of NetworkException or TimeoutException and
then continue instead of stream thread die?
Problem of specifying ProductionExceptionHandlerResponse.CONTINUE in the
CustomProductionExceptionHandler is - In case of any exception it skips
that record publishing to output topic and proceed with next records. No Resiliency.
1) It's not really possible to answer this question without knowing what your program does. Note, that in general, Kafka Streams works on event-time and handle out-of-order data.
2) You can configure all internally used client of a Kafka Streams application (ie, consumer, producer, admin client, and restore consumer) by specifying the corresponding client configuration in the Properties you pass into KafkaStreams. If you wand different configs for different clients, you can prefix them accordingly, ie, producer.retries instead of retries. Check out the docs for more details: https://docs.confluent.io/current/streams/developer-guide/config-streams.html#ak-consumers-producer-and-admin-client-configuration-parameters

How to save message into database and send response into topic eventually consistent?

I have the following rabbitMq consumer:
Consumer consumer = new DefaultConsumer(channel) {
#Override
public void handleDelivery(String consumerTag, Envelope envelope, MQP.BasicProperties properties, byte[] body) throws IOException {
String message = new String(body, "UTF-8");
sendNotificationIntoTopic(message);
saveIntoDatabase(message);
}
};
Following situation can occur:
Message was send into topic successfully
Connection to database was lost so database insert was failed.
As a result we have data inconsistency.
Expected result either both action were successfully executed or both were not executed at all.
Any solutions how can I achieve it?
P.S.
Currently I have following idea(please comment upon)
We can suppose that broker doesn't lose any messages.
We have to be subscribed on topic we want to send.
Save entry into database and set field status with value 'pending'
Attempt to send data to topic. If send was successfull - update field status with value 'success'
We have to have a sheduled job which have to check rows with pending status. At the moment 2 cases are possible:
3.1 Notification wasn't send at all
3.2 Notification was send but save into database was failed(probability is very low but it is possible)
So we have to distinquish that 2 cases somehow: we may store messages from topic in the collection and job can check if the message was accepted or not. So if job found a message which corresponds the database row we have to update status to "success". Otherwise we have to remove entry from database.
I think my idea has some weaknesses(for example if we have multinode application we have to store messages in hazelcast(or analogs) but it is additional point of hypothetical failure)
Here is an example of Try Cancel Confirm pattern https://servicecomb.apache.org/docs/distributed_saga_3/ that should be capable of dealing with your problem. You should tolerate some chance of double submission of the data via the queue. Here is an example:
Define abstraction Operation and Assign ID to the operation plus a timestamp.
Write status Pending to the database (you can do this in the same step as 1)
Write a listener that polls the database for all operations with status pending and older than "timeout"
For each pending operation send the data via the queue with the assigned ID.
The recipient side should be aware of the ID and if the ID has been processed nothing should happen.
6A. If you need to be 100% that the operation has completed you need a second queue where the recipient side will post a message ID - DONE. If such consistency is not necessary skip this step. Alternatively it can post ID -Failed reason for failure.
6B. The submitting side either waits for a message from 6A of completes the operation by writing status DONE to the database.
Once a sertine timeout has passed or certain retry limit has passed. You write status to operation FAIL.
You can potentialy send a message to the recipient side opertaion with ID rollback.
Notice that all this steps do not involve a technical transactions. You can do this with a non transactional database.
What I have written is a variation of the Try Cancel Confirm Pattern where each recipient of message should be aware of how to manage its own data.
In the listener save database row with field staus='pending'
Another job(separated thread) will obtain all pending rows from DB and following for each row:
2.1 send data to topic
2.2 save into database
If we failured on the step 1 - everything is ok - data in consistent state because job won't know anything about that data
if we failured on the step 2.1 - no problem, next job invocation will attempt to handle it
if we failured on the step 2.2 - If we failured here - it means that next job invocation will handle the same data again. From the first glance you can think that it is a problem. But your consumer has to be idempotent - it means that it has to understand that message was already processed and skip the processing. This requirement is a consequence that all message brokers have guarantees that message will be delivered AT LEAST ONCE. So our consumers have to be ready for duplicated messages anyway. No problem again.
Here's the pseudocode for how i'd do it: (Assuming the dao layer has transactional capability and your messaging layer doesnt)
//Start a transaction
try {
String message = new String(body, "UTF-8");
// Ordering is important here as I'm assuming the database has commit and rollback capabilities, but the messaging system doesnt.
saveIntoDatabase(message);
sendNotificationIntoTopic(message);
} catch (MessageDeliveryException) {
// rollback the transaction
// Throw a domain specific exception
}
//commit the transaction
Scenarios:
1. If the database fails, the message wont be sent as the exception will break the code flow .
2. If the database call succeeds and the messaging system fails to deliver, catch the exception and rollback the database changes
All the actions necessary for logging and replaying the failures can be outside this method
If there is enough time to modify the design, it is recommended to use JTA like APIs to manage 2phase commit. Even weblogic and WebSphere support XA resource for 2 phase commit.
If timeline is less, it is suggested perform as below to reduce the failure gap.
Send data topic (no commit) (incase topic is down, retry to be performed with an interval)
Write data into DB
Commit DB
Commit Topic
Here failure will happen only when step 4 fails. It will result in same message send again. So receiving system will receive duplicate message. Each message has unique messageID and CorrelationID in JMS2.0 structure. So finding duplicate is bit straight forward (but this is to be handled at receiving system)
Both case will work for clustered environment as well.
Strict to your case, thought below steps might help to overcome your issue
Subscribe a listener listener-1 to your topic.
Process-1
Add DB entry with status 'to be sent' for message msg-1
Send message msg-1 to topic. Retry sending incase of any topic failure
If step 2 failed after certain retry, process-1 has to resend the msg-1 before sending any new messages OR step-1 to be rolled back
Listener-1
Using subscribed listener, read reference(meesageID/correlationID) from Topic, and update DB status to SENT, and read/remove message from topic. Incase reference-read success and DB update failed, topic still have message. So next read will update DB. Incase DB update success and message removal failed. Listener will read again and tries to update message which is already done. So can be ignored after validation.
Incase listener itself down, topic will have messages until listener reading the messages. Until then SENT messages will be in status 'to be sent'.

Failure retry in EJB 3

We have recently migrated our EJB 2 application to EJB 3.In EJB2 if some failures in onMessage container will be able to do retry the message on configured number of times however in EJB3 there is no such option.Could someone help on this.
Can we explicitly sleep the thread and do explicitly retry in onMessage?
Thanks in advance .
If you are using #TransactionManagement(value=
TransactionManagementType.CONTAINER) that is container managed
transaction then on exception, message will be retired 10 time
before the message is send to the DLQ.
If you are not using Activemq RA then, following two documents can
be useful to you if you are having Container-Managed Transaction
Redelivery and Exception Handling and Managing Rolled Back,
Recovered, Redelivered, or Expired Messages
If you are using ActiveMq resource adapter, use can use
MaximumRedeliveries Resource Adapter properties
Else, if you want to retry only on specific exception then you can
catch the exception and then send the message back to the same queue
and with this additional property. Activemq consume message after
delay interval Also, set the retry count in the message header
so that you can keep the track of the retries.

ActiveMQ : dead letter queue keeps my messages order

I use ActiveMQ as a broker to deliver messages. Theses messages are intented to be written in a dabatase. Sometimes, the database is unreachable or down. In that case, I want to rollback my message to retry later this message and I want to continue reading other messages.
This code works fine, except one point : the rollbacked message is blocking me from reading the others :
private Connection getConnection() throws JMSException {
RedeliveryPolicy redeliveryPolicy = new RedeliveryPolicy();
redeliveryPolicy.setMaximumRedeliveries(3); // will retry 3 times to dequeue rollbacked messages
redeliveryPolicy.setInitialRedeliveryDelay(5 *1000); // will wait 5s to read that message
ActiveMQConnectionFactory connectionFactory = new ActiveMQConnectionFactory(user, password, url);
Connection connection = connectionFactory.createConnection();
((ActiveMQConnection)connection).setUseAsyncSend(true);
((ActiveMQConnection)connection).setDispatchAsync(true);
((ActiveMQConnection)connection).setRedeliveryPolicy(redeliveryPolicy);
((ActiveMQConnection)connection).setStatsEnabled(true);
connection.setClientID("myClientID");
return connection;
}
I create my session this way :
session = connection.createSession(true, Session.SESSION_TRANSACTED);
Rollback is easy to ask :
session.rollback();
Let's imagine I have 3 messages in my queue :
1: ok
2: KO (will need to be treated again : the message I want to rollback)
3: ok
4: ok
My consumer will do (linear sequence) :
commit 1
rollback 2
wait 5s
rollback 2
wait 5s
rollback 2
put 2 in dead letter queue (ActiveMQ.DLQ)
commit 3
commit 4
But I want :
commit 1
rollback 2
commit 3
commit 4
wait 5s
rollback 2
wait 5s
rollback 2
wait 5s
put 2 in dead letter queue (ActiveMQ.DLQ)
So, how can I configure my Consumer to delay my rollbacked messages later ?
This is actually expected behavior, because message retries are handled by the client, not the broker. So, since you have 1 session bound, and your retry policy is setup for the 3 retries before DLQ, then the entire retry process blocks that particular thread.
So, my first question is that if the database insert fails, wouldn't you expect all of the rest of your DB inserts to fail for a similar reason?
If not, the way to get around that is to set the retry policy for that queue to be 0 retries, with a specific DLQ, so that messages will fail immediately and go into the DLQ. Then have another process that pulls off of the DLQ every 5 seconds and reprocesses and/or puts it back in the main queue for processing.
Are you using the <strictOrderDispatchPolicy /> in the ActiveMQ XML config file? I'm not sure if this will affect the order of messages for redelivery or not. If you are using strict order dispatch, try commenting out that policy to see if that changes the behavior.
Bruce
I had same problem, i haven't found solution here so decided to post it here after i found one for people struggling with the same.
This is fixed prior to version 5.6 when you set property nonBlockingRedelivery to true in connection factory:
<property name="nonBlockingRedelivery" value="true" />

Categories

Resources