Small question regarding duplicates requests with Spring Webflux + Sleuth Zipkin server please.
I have a server, which code is super simple:
#PostMapping("/question")
public Mono<String> question() {
LOGGER.info("This has been called!");
return someService.getResponse();
}
Every hour, I expect only one client that I know to call this endpoint only once.
Therefore, every hour, I do see this in my log:
INFO [myservice,c3a25fb0fb7426b7,c3a25fb0fb7426b7] 10 --- [or-http-epoll-3] c.my.Controller : This has been called!
So far so good.
The issue is that several times, I did see in my logs:
INFO [myservice,5278cfd673fddc60,1582c3da8d01adaa] 10 --- [or-http-epoll-2] c.my.Controller : This has been called!
INFO [myservice,5278cfd673fddc60,c8a85b0275b6bfdd] 10 --- [or-http-epoll-3] c.my.Controller : This has been called!
Very naturally, I assume the only client I know, instead of calling me once as expected, called me twice.
However, the logs on client side shows only one http outbound request has been made.
May I ask, seeing same trace ID, but different Span ID is enough to prove, be hard evidence there is at least two requests sent?
Can the [or-http-epoll-2] and [or-http-epoll-3] help proving as well?
With the only information written here, is it possible to prove anything regarding the duplicates please?
Thank you
You can prove this by turning on access logs.
Having the same traceID for two different log events does not prove anything, it can happen that:
The client called you twice
The client called you once but you created another span
The client called you and another client which also called you
You can enable access logs which can prove this or you can use a rq/rs log library (like logbook) that does this for you. I recommend simply enabling the access logs.
Related
In a quarkus process we're performing below steps once the message is polled from kafka
Thread.sleep(30000) - Due to business logic
call a 3rd party API
call another 3rd party api
Inserting data in db
Once almost everyday the process hangs after throwing TooManyMessagesWithoutAckException.
2022-12-02 20:02:50 INFO [2bdf7fc8-e0ad-4bcb-87b8-c577eb506b38, ] : Going to sleep for 30 sec.....
2022-12-02 20:03:20 WARN [ kafka] : SRMSG18231: The record 17632 from topic-partition '<partition>' has waited for 60 seconds to be acknowledged. This waiting time is greater than the configured threshold (60000 ms). At the moment 2 messages from this partition are awaiting acknowledgement. The last committed offset for this partition was 17631. This error is due to a potential issue in the application which does not acknowledged the records in a timely fashion. The connector cannot commit as a record processing has not completed.
2022-12-02 20:03:20 WARN [ kafka] : SRMSG18228: A failure has been reported for Kafka topics '[<topic name>]': io.smallrye.reactive.messaging.kafka.commit.KafkaThrottledLatestProcessedCommit$TooManyMessagesWithoutAckException: The record 17632 from topic/partition '<partition>' has waited for 60 seconds to be acknowledged. At the moment 2 messages from this partition are awaiting acknowledgement. The last committed offset for this partition was 17631.
2022-12-02 20:03:20 INFO [2bdf7fc8-e0ad-4bcb-87b8-c577eb506b38, ] : Sleep over!
Below is an example on how we are consuming the messages
#Incoming("my-channel")
#Blocking
CompletionStage<Void> consume(Message<Person> person) {
String msgKey = (String) person
.getMetadata(IncomingKafkaRecordMetadata.class).get()
.getKey();
// ...
return person.ack();
}
As per the logs only 30 seconds have passed since the event was polled but the exception of kafka acknowledgement not being sent for 60 second is thrown.
I checked whole day's log when the error was thrown to see if the REST api calls took more than 30 seconds to fetch the data, but I wasn't able to find any.
We haven't done any specific kafka configuration other than topic name, channel name, serializer, deserializer, group id and managed kafka connection details.
There are 4 partitions in this topic with replication factor of 3. There are 3 pods running for this process.
We're unable to reproduce to this issue in Dev and UAT environments.
I checked configuration options which but couldn't find any configuration which might help : Quarkus Kafka Reference
mp:
messaging:
incoming:
my-channel:
topic: <topic>
group:
id: <group id>
connector: smallrye-kafka
value:
serializer: org.apache.kafka.common.serialization.StringSerializer
deserializer: org.apache.kafka.common.serialization.StringDeserializer
Is it possible that quarkus is acknowledging the messages in batches and by that time the waiting time has already reached the threshold?
Please comment if there are any other possibilities for this issue.
I have similiar issues on our production environment running different quarkus services with a simple 3-Node-Kafka-Cluster and I researched the problem a lot - with no clear answer. At the moment, I have two approaches to this problem:
Make sure, you really ack or nack the kafka-message in your code. Is really every exception catched and answered with a "person.nack(exception);" (or a "person.ack(()" - depends on your failure strategy)? Make sure it is. The error Throttled-Exception is thrown, if no ack() OR nack() is performed. The problem occurres mostly, if nothing happens at all.
When this does not help, I switch the commit-strategy to "latest":
mp.messaging.incoming.my-channel.commit-strategy=latest
This is a little slower, because the batch commit is disabled, but runs stable in my case. If you don't know about commit strategies and the default, catch up with the good article by Escoffier:
I am aware, that this does not solve the root-cause, but helped in desperate times. The problem has to be, that one or more queued messages are not acknowledged in time, but I can't tell you why. Maybe the application logic is too slow, but I have a hard time - like you - to reproduce this locally. You can also try to increase the threshold of 60 sec with throttled.unprocessed-record-max-age.ms and a see for yourself, if this helps. In my case, it did not. Maybe someone else can share his insights with this problem and can provide you with a real solution.
Here I wanted to register to 2 endpoints and send requests to them. You can see this in the code below. I name one env1 and the other env2.
val client = Http.client
.configured(Transport.Options(noDelay = false, reuseAddr = false))
.newService("gexampleapi-env1.localhost.net:8081,gexampleapi-env2.localhost.net:8081")
So far everything is normal. But env1 instance had to be down for some reason(for a few hours' maintenance etc. not sure why.). Under normal circumstances, our expectation is that it continues to send requests through the env2 instance. But this didn't happen. Could not send requests to both servers. Normally it was working correctly, but it didn't work that day for a reason we don't know.
Since the event took place months ago, I only have the following log.
2022-02-15 12:09:40,181 [finagle/netty4-1-3] INFO com.twitter.finagle
FailureAccrualFactory marking connection to "gExampleAPI" as dead.
Remote Address:
Inet(gexampleapi-env1.localhost.net/10.0.0.1:8081,Map())
To solve the problem, we removed gexampleapi-env1.localhost.net:8081 host from the config file. and after restarting it continued to process requests. If you have any ideas about why we may have experienced this problem and how to avoid this next time, I would appreciate it if you could share them.
I have a problem while trying my hands on the Hello World example explained here.
Kindly note that I have just modified the HelloEntity.java file to be able to return something other than "Hello, World!". Most certain my changes are taking time and hence I am getting the below Timeout error.
I am currently trying (doing a PoC) on a single node to understand the Lagom framework and do not have liberty to deploy multiple nodes.
I have also tried modifying the default lagom.circuit-breaker in application.conf "call-timeout = 100s" however, this does not seem to have helped.
Following is the exact error message for your reference:
{"name":"akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://hello-impl-application/system/sharding/HelloEntity#1074448247]] after [5000 ms]. Sender[null] sent message of type \"com.lightbend.lagom.javadsl.persistence.CommandEnvelope\".","detail":"akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://hello-impl-application/system/sharding/HelloEntity#1074448247]] after [5000 ms]. Sender[null] sent message of type \"com.lightbend.lagom.javadsl.persistence.CommandEnvelope\".\n\tat akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:595)\n\tat akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:605)\n\tat akka.actor.Scheduler$$anon$4.run(Scheduler.scala:140)\n\tat scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:866)\n\tat scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)\n\tat scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)\n\tat scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:864)\n\tat akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\n\tat java.lang.Thread.run(Thread.java:748)\n"}
Question: Is there a way to increase the akka Timeout by modifying the application.conf or any of the java source files in the Hello World project? Can you please help me with the exact details.
Thanks in advance for you time and help.
The call timeout is the timeout for circuit breakers, which is configured using lagom.circuit-breaker.default.call-timeout. But that's not what is timing out above, the thing that is timing out above is the request to your HelloEntity, that timeout is configured using lagom.persistence.ask-timeout. The reason why there's a timeout on requests to entities is because in a multi-node environment, your entities are sharded across nodes, so an ask on them may go to another node, which is why a timeout is needed in case that node is not responding.
All that said, I don't think changing the ask-timeout will solve your problem. If you have a single node, then your entities should respond instantly if everything is working ok.
Is that the only error you're seeing in the logs?
Are you seeing this in devmode (ie, using the runAll command), or are you running the Lagom service some other way?
Is your database responding?
Thanks James for the help/pointer.
Adding following lines to resources/application.conf did the trick for me:
lagom.persistence.ask-timeout=30s
hello {
..
..
call-timeout = 30s
call-timeout = ${?CIRCUIT_BREAKER_CALL_TIMEOUT}
..
}
A Call is a Service-to-Service communication. That’s a SeviceClient communicating to a remote server. It uses a circuit breaker. It is a extra-service call.
An ask (in the context of lagom.persistence) is sending a command to a persistent entity. That happens across the nodes insied your Lagom service. It is not using circuit breaking. It is an intra-service call.
I am trying to exhibit backpressure using spring-web-reactive just like the way it is shown here with akka - https://www.youtube.com/watch?v=oS9w3VenDW0
(Watch between 28:20 and 29:20).
To try it out I have used below sample project from github https://github.com/bclozel/spring-boot-web-reactive
Upon setup of the project I added an new endpoint in HomeController.java as shown below:
#RequestMapping(value = "/longflux",produces = "application/stream+json")
public Flux<Long> longFlux(){
return Flux.interval(Duration.ofMillis(10)).log();
}
Now, if I try to curl this endpoint and then suspend it using (CTRL+z), backpressure should have kicked in as soon as the tcp buffers are filled and server should stop emitting the events.
However, suspending the curl command after sometime throws below exception :
2017-02-16 08:49:48.480 ERROR 3500 --- [ timer-1] reactor.Flux.Interval.4 : onError(reactor.core.Exceptions$OverflowException: Could not emit value 2578 due to lack of requests)
2017-02-16 08:49:48.481 ERROR 3500 --- [ timer-1] reactor.Flux.Interval.4 :
reactor.core.Exceptions$OverflowException: Could not emit value 2578 due to lack of requests
at reactor.core.Exceptions.failWithOverflow(Exceptions.java:151) ~[reactor-core-3.0.4.RELEASE.jar:3.0.4.RELEASE]
at reactor.core.publisher.FluxInterval$IntervalRunnable.run(FluxInterval.java:98) ~[reactor-core-3.0.4.RELEASE.jar:3.0.4.RELEASE]
at reactor.core.scheduler.SingleTimedScheduler$TimedPeriodicScheduledRunnable.run(SingleTimedScheduler.java:394) ~[reactor-core-3.0.4.RELEASE.jar:3.0.4.RELEASE]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_121]
I am not able to understand why the request terminated with exception in sometime after the curl command was suspended(In spring-web-reactive implementation), whereas in the akka example(as demonstrated in the youtube link) the server stopped publishing events once the tcp buffer was full.
Flux.interval is a special case, since it's a hot source and time is not buffered by Reactor; this means that if your request cycle is slow due to backpressure and your interval source is producing faster, Reactor will emit an Error signal.
You can update this sample with a .onBackpressureDrop() operator to drop interval in case of backpressure. This should behave as expected.
There are many ways to illustrate backpressure, including:
delaying the subscription with a delay operator
simulating multiple slow clients (bandwidth and latency)
I'm developing a project using Grizzly 2.3.22 with its Websocket support. Everything was OK until OOM happened. Looking through the dump I found that all the memory was eaten up by a single org.glassfish.grizzly.nio.transport.TCPNIOConnection holding a huge (1,5GB) write queue. I guess one of the client developers was debugging their connected application and stopped on a breakpoint for a long time. Anyway, this can easily happen if a client has a very slow connection - my server should be ready for that.
In the Grizzly documentation I found the maxPendingBytes property, which seem like a solution, at least for now. But I cannot get it to work at all. I set log level to ALL for AbstractNIOAsyncQueueWriter, connect with the client, put it on hold and observe how the server's queue grows like this:
TRACE 2016-07-05 21:02:26.330 [nioEventLoopGroup-2-1] o.g.g.n.AbstractNIOAsyncQueueWriter - AsyncQueueWriter.write connection=TCPNIOConnection{localSocketAddress={/127.0.0.1:8445}, peerSocketAddress={/127.0.0.1:56185}}, record=org.glassfish.grizzly.asyncqueue.AsyncWriteQueueRecord#1e35bafb, directWrite=false, size=165, isUncountable=false, bytesToReserve=165, pendingBytes=16170
TRACE 2016-07-05 21:02:26.368 [nioEventLoopGroup-2-1] o.g.g.n.AbstractNIOAsyncQueueWriter - AsyncQueueWriter.write connection=TCPNIOConnection{localSocketAddress={/127.0.0.1:8445}, peerSocketAddress={/127.0.0.1:56185}}, record=org.glassfish.grizzly.asyncqueue.AsyncWriteQueueRecord#3d6e05dd, directWrite=false, size=165, isUncountable=false, bytesToReserve=165, pendingBytes=16335
...
When I set maxPendingBytes=10000 I expect an exception thrown when the pendingBytes from the log above becomes larger than 10000, but it doesn't happen.
Moreover, I tried debugging the server with the Grizzly's source code, and found that while the property's value does get assigned to the NIOConnection.maxAsyncWriteQueueSize field, the AbstractNIOAsyncQueueWriter.canWrite(...) method - the only place where the field seems to be used - is never called.
I'm at a loss. Am I missing something here?