In a quarkus process we're performing below steps once the message is polled from kafka
Thread.sleep(30000) - Due to business logic
call a 3rd party API
call another 3rd party api
Inserting data in db
Once almost everyday the process hangs after throwing TooManyMessagesWithoutAckException.
2022-12-02 20:02:50 INFO [2bdf7fc8-e0ad-4bcb-87b8-c577eb506b38, ] : Going to sleep for 30 sec.....
2022-12-02 20:03:20 WARN [ kafka] : SRMSG18231: The record 17632 from topic-partition '<partition>' has waited for 60 seconds to be acknowledged. This waiting time is greater than the configured threshold (60000 ms). At the moment 2 messages from this partition are awaiting acknowledgement. The last committed offset for this partition was 17631. This error is due to a potential issue in the application which does not acknowledged the records in a timely fashion. The connector cannot commit as a record processing has not completed.
2022-12-02 20:03:20 WARN [ kafka] : SRMSG18228: A failure has been reported for Kafka topics '[<topic name>]': io.smallrye.reactive.messaging.kafka.commit.KafkaThrottledLatestProcessedCommit$TooManyMessagesWithoutAckException: The record 17632 from topic/partition '<partition>' has waited for 60 seconds to be acknowledged. At the moment 2 messages from this partition are awaiting acknowledgement. The last committed offset for this partition was 17631.
2022-12-02 20:03:20 INFO [2bdf7fc8-e0ad-4bcb-87b8-c577eb506b38, ] : Sleep over!
Below is an example on how we are consuming the messages
#Incoming("my-channel")
#Blocking
CompletionStage<Void> consume(Message<Person> person) {
String msgKey = (String) person
.getMetadata(IncomingKafkaRecordMetadata.class).get()
.getKey();
// ...
return person.ack();
}
As per the logs only 30 seconds have passed since the event was polled but the exception of kafka acknowledgement not being sent for 60 second is thrown.
I checked whole day's log when the error was thrown to see if the REST api calls took more than 30 seconds to fetch the data, but I wasn't able to find any.
We haven't done any specific kafka configuration other than topic name, channel name, serializer, deserializer, group id and managed kafka connection details.
There are 4 partitions in this topic with replication factor of 3. There are 3 pods running for this process.
We're unable to reproduce to this issue in Dev and UAT environments.
I checked configuration options which but couldn't find any configuration which might help : Quarkus Kafka Reference
mp:
messaging:
incoming:
my-channel:
topic: <topic>
group:
id: <group id>
connector: smallrye-kafka
value:
serializer: org.apache.kafka.common.serialization.StringSerializer
deserializer: org.apache.kafka.common.serialization.StringDeserializer
Is it possible that quarkus is acknowledging the messages in batches and by that time the waiting time has already reached the threshold?
Please comment if there are any other possibilities for this issue.
I have similiar issues on our production environment running different quarkus services with a simple 3-Node-Kafka-Cluster and I researched the problem a lot - with no clear answer. At the moment, I have two approaches to this problem:
Make sure, you really ack or nack the kafka-message in your code. Is really every exception catched and answered with a "person.nack(exception);" (or a "person.ack(()" - depends on your failure strategy)? Make sure it is. The error Throttled-Exception is thrown, if no ack() OR nack() is performed. The problem occurres mostly, if nothing happens at all.
When this does not help, I switch the commit-strategy to "latest":
mp.messaging.incoming.my-channel.commit-strategy=latest
This is a little slower, because the batch commit is disabled, but runs stable in my case. If you don't know about commit strategies and the default, catch up with the good article by Escoffier:
I am aware, that this does not solve the root-cause, but helped in desperate times. The problem has to be, that one or more queued messages are not acknowledged in time, but I can't tell you why. Maybe the application logic is too slow, but I have a hard time - like you - to reproduce this locally. You can also try to increase the threshold of 60 sec with throttled.unprocessed-record-max-age.ms and a see for yourself, if this helps. In my case, it did not. Maybe someone else can share his insights with this problem and can provide you with a real solution.
Related
Restart Strategy working fine, but losing the messages when entire job manager restarting after the max retry attempt.
For example, I have send a 2 msg continuously , first msg has a exception so its retrying with max attempt which I mentioned in the config. After that its restarting the entire job manager . this time am losing the second message .
streamExecutionEnvironment.setRestartStrategy(
RestartStrategies.fixedDelayRestart(4, // number of restart attempts
Time.of(4, TimeUnit.SECONDS) // delay
));
Once the Job manager is comesup , I expected to consume the second message. but its not consuming . seems like we are losing the second message.Could any one help me out for this situation ?
Without more information, it's not clear what's happening, or why. But I'll throw out some guesses, and perhaps one will be correct.
You could be stuck in a fail -> restart -> fail again loop. If you don't skip over poison pills, Flink will:
throw an exception caused by a poison pill (a record that can't be processed for some reason)
restart
try again to consume the poison pill, and fail again
restart again
...
Or you could be using a source that doesn't participate in checkpointing.
Or perhaps your source isn't rewindable. Flink's approach to fault tolerance requires that the sources can be rewound, and then any input records consumed since the last checkpoint will be re-ingested after a restart. But some sources can't support this (e.g., sockets, or http endpoints).
Or you could be relying on the Job Manager for JobManagerCheckpointStorage, in which case the checkpoints are lost when the JM restarts.
A job failure shouldn't cause a Job Manager restart. And it sounds your cluster is probably not set up to handle recovery after a Job Manager failure. You could read the docs on HA for your specific resource provider -- the entry point is here.
In our infrastructure we are running Kafka with 3 nodes and have several spring boot services running in OpenShift. Some of the communication between the services happens via Kafka. For the consumers/listeners we are using the #KafkaListener spring annotation with a unique group ID so that each instance (pod) consumes all the partitions of a topic
#KafkaListener(topics = "myTopic", groupId = "group#{T(java.util.UUID).randomUUID().toString()}")
public void handleMessage(String message) {
doStuffWithMessage(message);
}
For the configuration we are using pretty much the default values. For the consumers all we got is
spring.kafka.consumer:
auto-offset-reset: latest
value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
Sometimes we face the unfortunate situation, where all of our Kafka nodes are shortly down, which will result in the consumers unregistering, as logged by org.apache.kafka.common.utils.AppInfoParser
App info kafka.consumer for consumer-group5c327050-5b05-46fb-a7be-c8d8a20d293a-1 unregistered
Once the nodes are up again, we would expect that the consumers register again, however that is not the case. So far we have no idea why they fail to do so. For now we are forced to restart the affected pods, when this issue occurs. Did anybody have a similar issue before or has an idea what we might be doing wrong?
Edit: We are using the following versions
spring-boot 2.6.1
spring-kafka 2.8.0
apache kafka 2.8.0
In kafka config you can use reconnect.backoff.max.ms config parameter to set a maximum number of milliseconds to retry connecting.
Additionally, set the parameter reconnect.backoff.ms to a base number of milliseconds to wait before retrying to connect.
If provided, the backoff per host will increase exponentially for each
consecutive connection failure, up to this maximum.
Kafka documentation https://kafka.apache.org/31/documentation/#streamsconfigs
If you set the max milliseconds to reconnect to something fairly high, like a day, the connection will be reattempted for up to a day (With increasing intervals, 50,500,5000,50000 etc').
We did some more digging in our logs and found the underlying issue that causes the consumer(s) to be stopped.
Authentcation/Authorization Exception and no authExceptionRetryInterval set
So apparently the consumer is getting an Authentcation/Authorization Exception when trying to reconnect to the currently unavailable Kafka nodes and since we did not set authExceptionRetryInterval there won't be any retries and the consumer (listener container) is stopped. https://docs.spring.io/spring-kafka/api/org/springframework/kafka/listener/ConsumerProperties.html#setAuthExceptionRetryInterval(java.time.Duration)
Set the interval between retries after and AuthenticationException or org.apache.kafka.common.errors.AuthorizationException is thrown by KafkaConsumer. By default the field is null and retries are disabled. In such case the container will be stopped. The interval must be less than max.poll.interval.ms consumer property.
We are quite confident, that setting authExceptionRetryInterval will solve our problem.
I'm trying to interrupt current thread if http request times out. I have setup PlatformTransactionManager for Kafka Transactions as a bean. I'm using #Transactional annotation at method level. We are publishing message in 3 topics. After publishing message in first topic I'm putting Thread.sleep(5000) and current thread is interrupting from filter if execution takes more than 6 seconds. So here call is getting interrupted but message is getting published to Kafka. We are just producing the message. We are not consuming any message but able to see message in our internal Kafka Inspection Tool. We are using KafkaTemplate.send() to send message.
Producer records are always written to the log, even if rolled back. There is a special record in the slot following the published record(s) to indicate whether the transaction was committed or rolled back.
Consumers' isolation.level is read_uncommitted by default; you need to set it to read_committed to avoid seeing rolled-back records.
https://kafka.apache.org/documentation/#consumerconfigs_isolation.level
Streaming application is rolled out in production and right after 10 days observing errors/warnings in the CustomProductionExceptionHandler for expired transactions which belongs to older day window.
FLOW :
INPUT TOPIC --> STREAMING APPLICATION(Produces stats and emits after day window closed) --> OUTPUT TOPIC
Producer continuously trying to publish records to OUTPUT Topic which is already expired with older window and logs an error into CustomProductionExceptionHandler.
I have reduced batch size and kept default but this change is not yet promoted to production.
CustomProductionExceptionHandler Implementation: To Avoid streaming to die due to NeworkException,TimeOutException.
With this implementation producer does not retry and in case of any exceptions it does CONTINUE.. On other side upon returning FAIL.. stream thread dies and does not auto restart..Need suggestions..
public class CustomProductionExceptionHandler implements ProductionExceptionHandler {
#Override
public ProductionExceptionHandlerResponse handle(final ProducerRecord<byte[], byte[]> record,
final Exception exception) {
String recordKey = new String(record.key());
String recordVal = new String(record.value());
String recordTopic = record.topic();
logger.error("Kafka message marked as processed although it failed. Message: [{}:{}], destination topic: [{}]", recordKey,recordVal,recordTopic,exception);
return ProductionExceptionHandlerResponse.CONTINUE;
}
}
Exception:
2019-12-20 16:31:37.576 ERROR com.jpmc.gpg.exception.CustomProductionExceptionHandler.handle(CustomProductionExceptionHandler.java:19) kafka-producer-network-thread | profile-day-summary-generator-291e69b1-5a3d-4d49-8797-252c2ae05607-StreamThread-19-producerid - Kafka message marked as processed although it failed. Message: [{"statistics":{}], destination topic: [OUTPUT-TOPIC]
org.apache.kafka.common.errors.TimeoutException: Expiring * record(s) for TOPIC:1086149 ms has passed since batch creation
Trying to get answer for below questions.
1) Why producer is trying to publish older transactions to OUTPUT Topic for which day window is already closed?
Example - Producer is trying to send 12/09 day window transaction but current opened window is 12/20
2) Streaming threads could have been died without CustomProductionExceptionHandler -->
ProductionExceptionHandlerResponse.CONTINUE.
Do we have any way that Producer can do retries in case of NetworkException or TimeoutException and
then continue instead of stream thread die?
Problem of specifying ProductionExceptionHandlerResponse.CONTINUE in the
CustomProductionExceptionHandler is - In case of any exception it skips
that record publishing to output topic and proceed with next records. No Resiliency.
1) It's not really possible to answer this question without knowing what your program does. Note, that in general, Kafka Streams works on event-time and handle out-of-order data.
2) You can configure all internally used client of a Kafka Streams application (ie, consumer, producer, admin client, and restore consumer) by specifying the corresponding client configuration in the Properties you pass into KafkaStreams. If you wand different configs for different clients, you can prefix them accordingly, ie, producer.retries instead of retries. Check out the docs for more details: https://docs.confluent.io/current/streams/developer-guide/config-streams.html#ak-consumers-producer-and-admin-client-configuration-parameters
I would like to fine tune the KafkaTemplate options for the Producer, to handle as optimally as possible, the various failover and recovery scenarios.
We have our KafkaProducerMessageHandler running in sync mode (i.e. waiting for the send operation results - see: acks below). Note: this is necessary in the current version of Kafka to enable ErrorChannel reporting.
Here are the options I have choosen:
acks = 1 (we are performing basic acknowlegement from the Kafka
broker leader)
retries = 10
max.in.flight.requests.per.connection = 1 (This will keep the messages in order, if an error state is reached)
linger.ms = 1 (not sure about this one or whether it is relevant?)
request.timeout.ms = 5000
(five seconds for timeout, this will work with the retries - so
total time of 50 seconds, before the message is deemed to have failed and will then appear on the error channel)
enable.idempotence = false (again, not sure about this option?)
retry.backoff.ms = 100 (this is the default - again is it worth playing with?)
How do these values sound?
Is there anything I am missing?
This is an old post about Kafka producer tuning: http://ingest.tips/2015/07/19/tips-for-improving-performance-of-kafka-producer/
TLDR version:
Pay attention on the 'batch.size' and 'linger.ms' parameters.