In spring batch remote chunking, consider a situation where the worker consumers are not available. In that case the manager publishes number of items equal to the throttle limit and waits for the reply from the workers. I have the receiveTimeout for the MessagingTemplate set to 5000. This loop from ChunkMessageChannelItemWriter write() is running infinitely because after every 5 seconds, the MessagingTemplate receive() timeout and returns null here. It doesn't perform anything if the result is null and thus falls into infinite loop.
In this scenario, the Job remains in an UNKNOWN state. Is there any way to perform the timeout from the manager end if it doesn't get any reply from the workers for some amount of time and mark the job as failed?
Is there any way to perform the timeout from the manager end if it doesn't get any reply from the workers for some quite an amount of time and mark the job as failed?
You can set the maxWaitTimeouts on the ChunkMessageChannelItemWriter parameters for that.
Related
imagine you have some task structure of
Task1
Task2: 1 million separate independent Subtask[i] that can run concurrently
Task3: must run once after ALL Task2 subtasks have completed
And all of Task1, Subtask[i] and Task3 are represented by MQ messages.
How can this be solved on an ActiveMQ? Especially the triggering of a Task3 message once all subtasks are complete.
I know, it's not a queueing problem, it's a fork-join problem. Lets say the environment dictates you must use an ActiveMQ for it.
Using ActiveMQ features, dynamic queues and consumers, stuff like that, is allowed. Using external counters, like a database row representing Task2's progress, is not allowed.
Hidden in this fork-join problem is a state management and observability challenge. Since the database is ruled out, you have to rely on something in-memory or on-queue.
Create a unique id for the task run -- something short, but with enough space to not collide like an airplane locator code-- ie. 34FDSX
Send all messages for the task to a queue://TASK.34FDSX.DATA
Send a control message to queue://TASK.34FDSX.CONTROL that contains the task id and expected total # of messages (including each messageId would be helpful too)
When consumers from queue://TASK.34FDSX.DATA complete their work, they should send a 'done' message to queue://TASK.34FDSX.DONE queue with their messageId or some identifier.
The consumers for the .CONTROL queue and the .DONE queue should be the same process and can track the expected and total completed tasks. Once everything is completed, he can fire the event to trigger Task #3.
This approach provides everything as 'online', and you can also timeout the .CONTROL and .DONE reader if too much time passes before the task completes.
Queue deletion can be done using ActiveMQ destination GC, or as a clean-up step in the .CONTROL/.DONE reader during the occurances when everything completes successfully.
Advantages:
No infinite blocking consumers
No infinite open transactions
State of the TASK is online and observable via the presence of queues and queue metrics-- queue size, enqueue count, dequeue count
The entire solution can be multi-threaded and the only requirement is that for a given task the .CONTROL/.DONE listener is the same consumer, but multiple tasks can have individual .CONTROL/.DONE listeners to scale.
The question here is a bit vague so my answer will have to be a bit vague as well.
Each of the million independent subtasks for "Task 2" can be represented by a single message. All these messages can be in the same queue. You can spin up as many consumers as you want and process all these messages (i.e. perform all the subtasks). Just ensure that these consumers either use client-acknowledge mode or a transacted session so that the message is not removed from the queue until they are done processing the message. Once there are no more messages in the queue then you know "Task 2" is done.
To detect when the queue is empty you can have a "special" consumer on the queue that periodically opens a transacted session and tries to consume a message from the queue. If the consumer receives a message then you can rollback the transacted session to put the message back on the queue and you know that the queue is not empty (i.e. "Task 2" is not done). If the consumer doesn't receive a message then you know the queue is empty and you can send another message indicating this. You could launch this special consumer as part of "Task 2" after all the messages for the subtasks have been sent to avoid detecting an empty queue prematurely.
To be clear, this is a simple solution. You could certainly add more complexity depending on your requirements, but your question just outlined the basic problem so it's unclear what other requirements you have (if any).
Scenario/Use Case:
I have a Spring Boot application using Spring for Kafka to send messages to Kafka topics. Upon completion of a specific event (triggered by http request) a new thread is created (via Spring #Async) which calls kafkatemplate.send() and has a callback for the ListenableFuture that it returns. The original thread which handled the http request returns a response to the calling client and is freed.
Normal Behavior:
Under normal application load I've verified that the individual messages are all published to the topic as desired (application log entries upon callback success or failure as well as viewing the message in the topic on the kafka cluster). If I bring down all kafka brokers for 3-5 minutes and then bring the cluster back online the application's publisher immediately re-establishes connection to kafka and proceeds with publishing messages.
Problem Behavior:
However, when performing load testing, if I bring down all kafka brokers for 3-5 minutes and then bring the cluster back online, the Spring application's publisher continues to show failures for all publish attempts. This continues for approximately 7 hours at which time the publisher is able to successfully re-establish communication with kafka again (usually this is preceeded by a broken pipe exception but not always).
Current Findings:
While performing the load test, for approx. 10 minutes, I connected to the the application using JConsole and monitored the producer metrics exposed via kafka.producer. Within the first approx. 30 seconds of heavy load, buffer-available-bytes continues to decrease until it reaches 0 and stays at 0. waiting-threads remains between 6 and 10 (alternates everytime I hit refresh) and buffer-available-bytes remains at 0 for approx. 6.5 hours. After that buffer-available-bytes shows all of the originally allocated memory restored but kafka publish attempts continue failing for approx. another 30 minutes before finally the desired behavior restores.
Current Producer Configuration
request.timeout.ms=3000
max.retry.count=2
max.inflight.requests=1
max.block.ms=10000
retry.backoff.ms=3000
All other properties are using their default values
Questions:
Given my use case would altering batch.size or linger.ms have any positive impact in terms of eliminating the issue encountered when under heavy load?
Given that I have separate threads all calling kafkatemplate.send() with separate messages and callbacks and I havemax.in.flight.requests.per.connection set to 1, are batch.size and linger.ms ignored beyond limiting the size of each message? My understanding is that no batching is actually occurring in this scenario and that each message is sent as a separate request.
Given that I have max.block.ms set to 10 seconds, why does buffer memory remain utilized for so long and why do all messages continue to fail to be published for so many hours. My understanding is that after 10 seconds each new publish attempt should fail and return the failure callback which in turn frees up the associated thread
Update:
To try and clarify thread usage. I'm using the single producer instance as recommended in the JavaDocs. There are threads such as https-jsse-nio-22443-exec-* which are handling incoming https requests. When a request comes in some processing occurs and once all non-kafka related logic completes a call is made to a method in another class decorated with #Async. This method makes the call to kafkatemplate.send(). The response back to the client is shown in the logs before the publish to kafka is performed (this is how Im verifying its being performed via separate thread as the service doesn't wait to publish before returning a response).
There are task-scheduler-* threads which appear to be handling the callbacks from kafkatemplate.send(). My guess is that the single kafka-producer-network-thread handles all of the publishing.
My application was making an http request and sending each message to a deadletter table on a database platform upon failure of each kafka publish. The same threads being spun up to perform the publish to kafka were being re-used for this call to the database. I moved the database call logic into another class and decorated it with its own #Async and custom TaskExecutor. After doing this, I've monitored JConsole and can see that the calls to Kafka appear to be re-using the same 10 threads (TaskExecutor: core Pool size - 10, QueueCapacity - 0, and MaxPoolSize - 80) and the calls to the database service are now using a separate thread pool (TaskExecutor: core Pool size - 10, QueueCapacity - 0, and MaxPoolSize - 80) which is consistently closing and opening new threads but staying at a relatively constant number of threads. With this new behavior buffer-available-bytes is remaining at a healthy constant level and the application's kafka publisher is successfully re-establishing connection once brokers are brought back online.
I'm having a hard time to understand how this could be solved so I'm asking it here in the hope someone else already faced the same problems.
We are running our #KafkaListener with manual ack mode and a dead letter recoverer with a retry limit of 3.
Manual ack mode is needed due to the business logic that we dont ack a message and pause consuming for 5 minutes when certain circumstances are given (external dependencies).
Also we do need the dead letter queue for messages we cannot process for some reason.
Now the problem in manual ack mode is that our listener/consumer does not acknowledge the message when he hits the retry limit and moves it to the dl queue.
If the consumer service will be restarted he tries to consume the messages again and again will move them to the dl queue.
Any ideas how we could solve this issue?
Thanks and greetings from Hamburg!
I would try to avoid using manual acks if possible; perhaps by increasing max.poll.interval.ms instead.
If you use AckMode.MANUAL_IMMEDIATE, it will be safe to perform the commit directly on the Consumer in the error handler.
Subclass the SeekToCurrentErrorHandler and override handle(), If super.handle() doesn't throw an exception, it means the retries are exceeded and you can commit the offset on the Consumer.
commitRecovered can be set to true on SeekToCurrentErrorHandler instance being provided to the ContainerListenerFactory.
Refer documentation here
public void setCommitRecovered(boolean commitRecovered)
Set to true to commit the offset for a recovered record. The container must be >configured with ContainerProperties.AckMode.MANUAL_IMMEDIATE. Whether or not the >commit is sync or async depends on the container's syncCommits property.
Sometimes due to some external problems, I need to requeue a message by basic.reject with requeue = true.
But I don't need to consume it immediately because it will possibly fail again in a short time. If I continuously requeue it, this may result in infinite loop and requeue.
So I need to consume it later, say one minute later,
And I need to know how many times the messages has been requeue so that I can stop requeue it but only reject it to declare it fails to consume.
PS: I am using Java client.
There are multiple solutions to point 1.
First one is the one chosen by Celery (a Python producer/consumer library that can use RabbitMQ as broker). Inside your message, add a timestamp at which the task should be executed. When your consumer gets the message, do not ack it and check its timestamp. As soon as the timestamp is reached, the worker can execute the task. (Note that the worker can continue working on other tasks instead of waiting)
This technique has some drawbacks. You have to increase the QoS per channel to an arbitrary value. And if your worker is already working on a long running task, the delayed task wont be executed until the first task has finished.
A second technique is RabbitMQ-only and is much more elegant. It takes advantage of dead-letter exchanges and Messages TTL. You create a new queue which isn't consumed by anybody. This queue has a dead-letter exchange that will forward the messages to the consumer queue. When you want to defer a message, ack it (or reject it without requeue) from the consumer queue and copy the message into the dead-lettered queue with a TTL equal to the delay you want (say one minute later). At (roughly) the end of TTL, the defered message will magically land in the consumer queue again, ready to be consumed. RabbitMQ team has also made the Delayed Message Plugin (this plugin is marked as experimental yet fairly stable and potential suitable for production use as long as the user is aware of its limitations and has serious limitations in term of scalability and reliability in case of failover, so you might decide whether you really want to use it in production, or if you prefer to stick to the manual way, limited to one TTL per queue).
Point 2. just requires putting a counter in your message and handling this inside your app. You can choose to put this counter in a header or directly in the body.
Using Spring-boot and Spring-starter-amqp for my messaging app. My Problem scenario is this :
First Problem : I want to requeue my message only for 5 times and then message should be comeout from queue if business exception occurs(here I am focusing only on business exception).
Second Problem : If first use case is possible, can we somehow increase time to requeue based on the attempts. Suppose in the first attempt business exception occurs it should requeue immediately but in the second attempt it should requeue after 2 mins then 4 mins then 6 mins then 8 mins then 10 mins like time will increase based on the attempt.
Thanks
For the simple case (fixed delay between attempts), you can configure the queue to send rejected messages a dead letter exchange, and throw AmqpRejectAndDontRequeueException.
Bind a queue to the dead letter exchange, and set a time to live (ttl) on that queue and also configure it with a dead letter exchange to which the original queue is bound. The message will be requeued after the TTL expires.
You would need to examine the x-death header to determine how many times around the loop the message has traversed. After your retries are exhausted, throw an ImmediateAcknowledgeAmqpException to discard the message.
For a variable delay, you have to republish the message yourself, to a delayed-message exchange (there's a broker plugin for that), with the delay increasing for each retry.
Also see this answer.