I am trying to exhibit backpressure using spring-web-reactive just like the way it is shown here with akka - https://www.youtube.com/watch?v=oS9w3VenDW0
(Watch between 28:20 and 29:20).
To try it out I have used below sample project from github https://github.com/bclozel/spring-boot-web-reactive
Upon setup of the project I added an new endpoint in HomeController.java as shown below:
#RequestMapping(value = "/longflux",produces = "application/stream+json")
public Flux<Long> longFlux(){
return Flux.interval(Duration.ofMillis(10)).log();
}
Now, if I try to curl this endpoint and then suspend it using (CTRL+z), backpressure should have kicked in as soon as the tcp buffers are filled and server should stop emitting the events.
However, suspending the curl command after sometime throws below exception :
2017-02-16 08:49:48.480 ERROR 3500 --- [ timer-1] reactor.Flux.Interval.4 : onError(reactor.core.Exceptions$OverflowException: Could not emit value 2578 due to lack of requests)
2017-02-16 08:49:48.481 ERROR 3500 --- [ timer-1] reactor.Flux.Interval.4 :
reactor.core.Exceptions$OverflowException: Could not emit value 2578 due to lack of requests
at reactor.core.Exceptions.failWithOverflow(Exceptions.java:151) ~[reactor-core-3.0.4.RELEASE.jar:3.0.4.RELEASE]
at reactor.core.publisher.FluxInterval$IntervalRunnable.run(FluxInterval.java:98) ~[reactor-core-3.0.4.RELEASE.jar:3.0.4.RELEASE]
at reactor.core.scheduler.SingleTimedScheduler$TimedPeriodicScheduledRunnable.run(SingleTimedScheduler.java:394) ~[reactor-core-3.0.4.RELEASE.jar:3.0.4.RELEASE]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_121]
I am not able to understand why the request terminated with exception in sometime after the curl command was suspended(In spring-web-reactive implementation), whereas in the akka example(as demonstrated in the youtube link) the server stopped publishing events once the tcp buffer was full.
Flux.interval is a special case, since it's a hot source and time is not buffered by Reactor; this means that if your request cycle is slow due to backpressure and your interval source is producing faster, Reactor will emit an Error signal.
You can update this sample with a .onBackpressureDrop() operator to drop interval in case of backpressure. This should behave as expected.
There are many ways to illustrate backpressure, including:
delaying the subscription with a delay operator
simulating multiple slow clients (bandwidth and latency)
Related
In a quarkus process we're performing below steps once the message is polled from kafka
Thread.sleep(30000) - Due to business logic
call a 3rd party API
call another 3rd party api
Inserting data in db
Once almost everyday the process hangs after throwing TooManyMessagesWithoutAckException.
2022-12-02 20:02:50 INFO [2bdf7fc8-e0ad-4bcb-87b8-c577eb506b38, ] : Going to sleep for 30 sec.....
2022-12-02 20:03:20 WARN [ kafka] : SRMSG18231: The record 17632 from topic-partition '<partition>' has waited for 60 seconds to be acknowledged. This waiting time is greater than the configured threshold (60000 ms). At the moment 2 messages from this partition are awaiting acknowledgement. The last committed offset for this partition was 17631. This error is due to a potential issue in the application which does not acknowledged the records in a timely fashion. The connector cannot commit as a record processing has not completed.
2022-12-02 20:03:20 WARN [ kafka] : SRMSG18228: A failure has been reported for Kafka topics '[<topic name>]': io.smallrye.reactive.messaging.kafka.commit.KafkaThrottledLatestProcessedCommit$TooManyMessagesWithoutAckException: The record 17632 from topic/partition '<partition>' has waited for 60 seconds to be acknowledged. At the moment 2 messages from this partition are awaiting acknowledgement. The last committed offset for this partition was 17631.
2022-12-02 20:03:20 INFO [2bdf7fc8-e0ad-4bcb-87b8-c577eb506b38, ] : Sleep over!
Below is an example on how we are consuming the messages
#Incoming("my-channel")
#Blocking
CompletionStage<Void> consume(Message<Person> person) {
String msgKey = (String) person
.getMetadata(IncomingKafkaRecordMetadata.class).get()
.getKey();
// ...
return person.ack();
}
As per the logs only 30 seconds have passed since the event was polled but the exception of kafka acknowledgement not being sent for 60 second is thrown.
I checked whole day's log when the error was thrown to see if the REST api calls took more than 30 seconds to fetch the data, but I wasn't able to find any.
We haven't done any specific kafka configuration other than topic name, channel name, serializer, deserializer, group id and managed kafka connection details.
There are 4 partitions in this topic with replication factor of 3. There are 3 pods running for this process.
We're unable to reproduce to this issue in Dev and UAT environments.
I checked configuration options which but couldn't find any configuration which might help : Quarkus Kafka Reference
mp:
messaging:
incoming:
my-channel:
topic: <topic>
group:
id: <group id>
connector: smallrye-kafka
value:
serializer: org.apache.kafka.common.serialization.StringSerializer
deserializer: org.apache.kafka.common.serialization.StringDeserializer
Is it possible that quarkus is acknowledging the messages in batches and by that time the waiting time has already reached the threshold?
Please comment if there are any other possibilities for this issue.
I have similiar issues on our production environment running different quarkus services with a simple 3-Node-Kafka-Cluster and I researched the problem a lot - with no clear answer. At the moment, I have two approaches to this problem:
Make sure, you really ack or nack the kafka-message in your code. Is really every exception catched and answered with a "person.nack(exception);" (or a "person.ack(()" - depends on your failure strategy)? Make sure it is. The error Throttled-Exception is thrown, if no ack() OR nack() is performed. The problem occurres mostly, if nothing happens at all.
When this does not help, I switch the commit-strategy to "latest":
mp.messaging.incoming.my-channel.commit-strategy=latest
This is a little slower, because the batch commit is disabled, but runs stable in my case. If you don't know about commit strategies and the default, catch up with the good article by Escoffier:
I am aware, that this does not solve the root-cause, but helped in desperate times. The problem has to be, that one or more queued messages are not acknowledged in time, but I can't tell you why. Maybe the application logic is too slow, but I have a hard time - like you - to reproduce this locally. You can also try to increase the threshold of 60 sec with throttled.unprocessed-record-max-age.ms and a see for yourself, if this helps. In my case, it did not. Maybe someone else can share his insights with this problem and can provide you with a real solution.
Small question regarding duplicates requests with Spring Webflux + Sleuth Zipkin server please.
I have a server, which code is super simple:
#PostMapping("/question")
public Mono<String> question() {
LOGGER.info("This has been called!");
return someService.getResponse();
}
Every hour, I expect only one client that I know to call this endpoint only once.
Therefore, every hour, I do see this in my log:
INFO [myservice,c3a25fb0fb7426b7,c3a25fb0fb7426b7] 10 --- [or-http-epoll-3] c.my.Controller : This has been called!
So far so good.
The issue is that several times, I did see in my logs:
INFO [myservice,5278cfd673fddc60,1582c3da8d01adaa] 10 --- [or-http-epoll-2] c.my.Controller : This has been called!
INFO [myservice,5278cfd673fddc60,c8a85b0275b6bfdd] 10 --- [or-http-epoll-3] c.my.Controller : This has been called!
Very naturally, I assume the only client I know, instead of calling me once as expected, called me twice.
However, the logs on client side shows only one http outbound request has been made.
May I ask, seeing same trace ID, but different Span ID is enough to prove, be hard evidence there is at least two requests sent?
Can the [or-http-epoll-2] and [or-http-epoll-3] help proving as well?
With the only information written here, is it possible to prove anything regarding the duplicates please?
Thank you
You can prove this by turning on access logs.
Having the same traceID for two different log events does not prove anything, it can happen that:
The client called you twice
The client called you once but you created another span
The client called you and another client which also called you
You can enable access logs which can prove this or you can use a rq/rs log library (like logbook) that does this for you. I recommend simply enabling the access logs.
Streaming application is rolled out in production and right after 10 days observing errors/warnings in the CustomProductionExceptionHandler for expired transactions which belongs to older day window.
FLOW :
INPUT TOPIC --> STREAMING APPLICATION(Produces stats and emits after day window closed) --> OUTPUT TOPIC
Producer continuously trying to publish records to OUTPUT Topic which is already expired with older window and logs an error into CustomProductionExceptionHandler.
I have reduced batch size and kept default but this change is not yet promoted to production.
CustomProductionExceptionHandler Implementation: To Avoid streaming to die due to NeworkException,TimeOutException.
With this implementation producer does not retry and in case of any exceptions it does CONTINUE.. On other side upon returning FAIL.. stream thread dies and does not auto restart..Need suggestions..
public class CustomProductionExceptionHandler implements ProductionExceptionHandler {
#Override
public ProductionExceptionHandlerResponse handle(final ProducerRecord<byte[], byte[]> record,
final Exception exception) {
String recordKey = new String(record.key());
String recordVal = new String(record.value());
String recordTopic = record.topic();
logger.error("Kafka message marked as processed although it failed. Message: [{}:{}], destination topic: [{}]", recordKey,recordVal,recordTopic,exception);
return ProductionExceptionHandlerResponse.CONTINUE;
}
}
Exception:
2019-12-20 16:31:37.576 ERROR com.jpmc.gpg.exception.CustomProductionExceptionHandler.handle(CustomProductionExceptionHandler.java:19) kafka-producer-network-thread | profile-day-summary-generator-291e69b1-5a3d-4d49-8797-252c2ae05607-StreamThread-19-producerid - Kafka message marked as processed although it failed. Message: [{"statistics":{}], destination topic: [OUTPUT-TOPIC]
org.apache.kafka.common.errors.TimeoutException: Expiring * record(s) for TOPIC:1086149 ms has passed since batch creation
Trying to get answer for below questions.
1) Why producer is trying to publish older transactions to OUTPUT Topic for which day window is already closed?
Example - Producer is trying to send 12/09 day window transaction but current opened window is 12/20
2) Streaming threads could have been died without CustomProductionExceptionHandler -->
ProductionExceptionHandlerResponse.CONTINUE.
Do we have any way that Producer can do retries in case of NetworkException or TimeoutException and
then continue instead of stream thread die?
Problem of specifying ProductionExceptionHandlerResponse.CONTINUE in the
CustomProductionExceptionHandler is - In case of any exception it skips
that record publishing to output topic and proceed with next records. No Resiliency.
1) It's not really possible to answer this question without knowing what your program does. Note, that in general, Kafka Streams works on event-time and handle out-of-order data.
2) You can configure all internally used client of a Kafka Streams application (ie, consumer, producer, admin client, and restore consumer) by specifying the corresponding client configuration in the Properties you pass into KafkaStreams. If you wand different configs for different clients, you can prefix them accordingly, ie, producer.retries instead of retries. Check out the docs for more details: https://docs.confluent.io/current/streams/developer-guide/config-streams.html#ak-consumers-producer-and-admin-client-configuration-parameters
I'm trying to execute an application under (reasonable) load. What is happening under load is that when trying to place a message onto a queue, the application stalls for about 4 seconds before completing the send. The strange part is that immediately after doing this, the next message takes a matter of milliseconds to place onto the queue. The message is in fact the same message - so the message size isn't a factor.
The application is using Spring Boot 2.1.6, Apache Qpid 0.43.0 as the JMS/AMQP provider.
The message bus being used is Azure ServiceBus, but I have observed the same behaviour using Artemis.
On the Apache Qpid JmsConnectionFactory, I've tried fiddling with the properties "forceSyncSend".
I've tried using the Spring Boot CachingConnectionFactory to cache message producers only. I have increased the default cache size from 1 to 20 without any success.
I've looked at the JmsTemplate parameters but can't find any parameters in regard to message producers (plenty with listeners but that's another story).
The code doing the sending is quite simple:
private void sendToQueue(Object message, String queueName) {
jmsTemplate.convertAndSend(queueName, message, (Message jmsMessage) -> {
jmsMessage.setStringProperty(OBJECT_TYPE_PARAMETER, message.getClass().getSimpleName());
return jmsMessage;
});
Is there anything obvious to try? Are there any tuning parameters to stop this stalling happening?
The load on the system is not trivial, but it is not excessive (it needs to go a lot higher than where it is at the moment!)
Any ideas?
I'm running an analysis on time duration to run a couchdb purge using a java program. The couchdb connections and calls are handled using ektorp. For a small number of documents purging takes place and I receive a success response.
But when I purge ~ 10000 or more, I get the following error:
org.ektorp.DbAccessException: 500:Internal Server Error
URI: /dbname/_purge
Response Body:
{
"error" : "timeout",
"reason" : "{gen_server,call,
....
On checking the db status using a curl command, the actual purging has taken place. But this timeout does not allow me to monitor the actual time of the purging method in my java program since this throws an exception.
On some research, I believe this is due to a default timeout value of an erlang gen_server process. Is there anyway for me to fix this?
I have tried changing the timeout values of the StdHttpClient to no avail.
HttpClient authenticatedHttpClient = new StdHttpClient.Builder()
.url(url)
.username(Conf.COUCH_USERNAME)
.password(Conf.COUCH_PASSWORD)
.connectionTimeout(600*1000)
.socketTimeout(600*1000)
.build();
CouchDB Dev here. You are not supposed to use purge with large numbers of documents. This is to remove accidentally added data from the DB, like credit card or social security numbers. This isn’t meant for general operations.
Consequently, you can’t raise that gen_server timeout :)