Is it possible to monitor the heartbeat-events Apache Storm sends to the nimbus?
I'm currently encountering an issue where my workers get seemingly randomly re-assigned, albeit we don't see any obvious spikes in CPU, RAM, I/O or network usage across the cluster which might indicate a bottleneck. The only way I was able to monitor this was when checking the supervisor.log file (Shutting down and clearing id xxx. Current supervisor time: 123. State: :disallowed, Heartbeat: { ... }) and misbehavior in the application's results. No errors in the storm UI and no stacktraces in the worker logs (such as out of memory or anything else).
Running Storm 0.10 on a small 4-node cluster w/ ~12 workers and ~650 executors. The max JVM ram / childopts setting for the workers is 4096MiB, which should not render funny results w/ the GC.
In fact, task heartbeat is being done with writing their built-in metrics to Zookeeper, and Nimbus reads them from Zookeeper.
When there're too many tasks in Storm cluster, it can occur heavy write to ZK, which makes ZK overloaded, resulting in writing / reading heartbeat not on time. So you need to monitor Zookeeper and take appropriate action when you found ZK is the bottleneck.
Related
I read many sites related to Storm's topologies design setup. But, I didn't get clarity.
In my project, I am going to processing more than a million records. So, I planned to create topologies dynamically based on internal modules. The count might be reached more than a thousand. My doubt is what is the best way to manage topologies? How many topologies can be created in a single cluster? Are there any problems with maintaining multiple topologies?
I would say, that this really depends on your machines in the cluster, so it is hard to answer that generally - this is especially true, if the cluster has heterogeneous instances.
Basically storm can handle many topologies that you can control over the CLI or the GUI.
I am currently managing those with storm list and storm kill commands. Limits should be in the RAM, storage and network connections of the single machines. To be precisely, I would predict, that the bottleneck is the JVM size of a supervisor instance. This can hold multiple workers (that each having components like bolts & spouts that are initially configured with a JVM of 256 MB), but if there are too many workers, the overall consumed JVM per supervisor will be exceeded.
Our Scala application (Kubernetes deployment) constantly experience Akka Cluster heartbeat delays of ≈3s.
Once we even had a 200s delay which also manifested itself in the following graph:
Can someone suggest things to investigate further?
Specs
Kubernetes 1.12.5
requests.cpu = 16
# limits.cpu not set
Scala 2.12.7
Java 11.0.4+11
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:+AlwaysPreTouch
-Xlog:gc*,safepoint,gc+ergo*=trace,gc+age=trace:file=/data/gc.log:time,level,tags:filecount=4,filesize=256M
-XX:+PerfDisableSharedMem
Akka Cluster 2.5.25
Java Flight Recording
Some example:
timestamp delay_ms
06:24:55.743 2693
06:30:01.424 3390
07:31:07.495 2487
07:36:12.775 3758
There were 4 suspicious time points where lots of Java Thread Park events were
registered simultaneously for Akka threads (actors & remoting)
and all of them correlate to heartbeat issues:
Around 07:05:39 there were no "heartbeat was delayed" logs, but was this one:
07:05:39,673 WARN PhiAccrualFailureDetector heartbeat interval is growing too large for address SOME_IP: 3664 millis
No correlation with halt events or blocked threads were found during
Java Flight Recording session, only two Safepoint Begin events
in proximity to delays:
CFS throttling
The application CPU usage is low, so we thought it could be related to
how K8s schedule our application node for CPU.
But turning off CPU limits haven't improved things much,
though kubernetes.cpu.cfs.throttled.second metric disappeared.
Separate dispatcher
Using a separate dispatcher seems to be unnecessary since delays happen even when
there is no load, we also built an explicit application similar to our own which
does nothing but heartbeats and it still experience these delays.
K8s cluster
From our observations it happens way more frequently on a couple of K8s nodes in
a large K8s cluster shared with many other apps when our application doesn't loaded much.
A separate
dedicated K8s cluster where our app is load tested almost have no issues with
heartbeat delays.
Have you been able to rule out garbage collection? In my experience, that's the most common cause for delayed heartbeats in JVM distributed systems (and the CFS quota in a Kubernetes/Mesos environment can make non-Stop-The-World GCs effectively STW, especially if you're not using a really recent (later than release 212 of JDK8) version of openjdk).
Every thread parking before "Safepoint begin" does lead me to believe that GC is in fact the culprit. Certain GC operations (e.g. rearranging the heap) require every thread to be in a safepoint, so every so often when not blocked, threads will check if the JVM wants them to safepoint; if so the threads park themselves in order to get to a safepoint.
If you've ruled out GC, are you running in a cloud environment (or on VMs where you can't be sure that the CPU or network aren't oversubscribed)? The akka-cluster documentation suggests increasing the akka.cluster.failure-detector.threshold value, which defaults to a value suitable for a more controlled LAN/bare-metal environment: 12.0 is recommended for cloud environments. This won't prevent delayed heartbeats, but it will decrease the chances of a spurious downing event because of a single long heartbeat (and also delay responses to genuine node loss events). If you want to tolerate a spike in heartbeat inter-arrival times from 1s to 200s, though, you'll need a really high threshold.
we are starting to work with Kafka streams, our service is a very simple stateless consumer.
We have tight requirements on latency, and we are facing too high latency problems when the consumer group is rebalancing. In our scenario, rebalancing will happen relatively often: rolling updates of code, scaling up/down the service, containers being shuffled by the cluster scheduler, containers dying, hardware failing.
One of the first tests we have done is having a small consumer group with 4 consumers handling a small amount of messages (1K/sec) and killing one of them; the cluster manager (currently AWS-ECS, probably soon moving to K8S) starts a new one. So, more than one rebalancing is done.
Our most critical metric is latency, which we measure as the milliseconds between message creation in the publisher and message consumption in the subscriber. We saw the maximum latency spiking from a few milliseconds, to almost 15 seconds.
We also have done tests with some rolling updates of code and the results are worse, since our deployment is not prepared for Kafka services and we trigger a lot of rebalancings. We'll need to work on that, but wondering what are the strategies followed by other people for doing code deployment / autoscaling with the minimum possible delays.
Not sure it might help, but our requirements are pretty relaxed related to message processing: we don't care about some messages being processed twice from time to time, or are very strict with the ordering of messages.
We are using all default configurations, no tuning.
We need to improve this latency spikes during rebalancing.
Can someone, please, give us some hints on how to work on it? Is touching configurations enough? Do we need to use some concrete parition Asignor? Implement our own?
What is the recommended approach to code deployment / autoscaling with the minimum possible delays?
Our Kafka version is 1.1.0, after looking at libs found for example kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0.
In the consumer side, we are using Kafka-streams 2.1.0.
Thank you for reading my question and your responses.
If the gap is introduced mainly from the rebalance, meaning that not triggering the rebalance but just left AWS / K8s to do their work and resume the bounced instance and pay the unavailability period of time during the bounce --- note that for stateless instances this is usually better, while for stateful applications you'd better make sure the restarted instance can access to its associated storage so that it can save on bootstrapping from the changelog.
To do that:
In Kafka 1.1, to reduce the unnecessary rebalance you can increase the session timeout of the group so that coordinator became "less sensitive" about members not responding with heartbeats --- note that we disabled the leave.group request since 0.11.0 for Streams' consumers (https://issues.apache.org/jira/browse/KAFKA-4881) so if we have a longer session timeout, the member leaving the group would not trigger rebalance, though member rejoining would still trigger one. Still one rebalance less is better than none.
In the coming Kafka 2.2 though, we've done a big improvement on optimizing rebalance scenarios, primarily captured in KIP-345 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Introduce+static+membership+protocol+to+reduce+consumer+rebalances). With that much fewer rebalances will be triggered with a rolling bounce, with a reasonable config settings introduced in KIP-345. So I'd strongly recommend you to upgrade to 2.2 and see if it helps your case
There are several configuration changes required in order to significantly decrease rebalance latency, especially during deployment rollout
1.Keep the latest version of Kafka-Streams
Kafka-Streams rebalance performance becomes better and better over time. A feature improvement that worth highlighting is Incremental cooperative rebalancing protocol. Kafka-Streams has this feature out of the box (since version 2.4.0, and with some improvements at 2.6.0), with default partition assignor StreamsPartitionAssignor.
2.Add Kafka-Streams configuration property internal.leave.group.on.close = true for sending consumer leave group request on app shutdown
By default, Kafka-Streams doesn't send consumer leave group request on app graceful shutdown, and, as a result, messages from some partitions (that were assigned to terminating app instance) will not be processed until session by this consumer will expire (with duration session.timeout.ms), and only after expiration, new rebalance will be triggered. In order to change such default behavior, we should use the internal Kafka Streams config property internal.leave.group.on.close = true (this property should be added during Kafka Streams creation new KafkaStreams(streamTopology, properties)). As the property is private, be careful and double-check before upgrading to a new version if the config is still there.
3.Decrease the number of simultaneously restarted app instances during deployment rollout
Using Kubernetes, we could control how many app instances are created with a new deployment at the same time. It's achievable by using properties max surge and max unavailable. If we have tens of app instances, default configuration will rollout multiple new instances and at the same time, multiple instances will be terminating. It means that multiple partitions will require reassignment to other app instances, and multiple rebalances will be fired, and it will lead to significant rebalance latency. The most preferable configuration for decreasing rebalance duration is changing these configurations to max surge = 1 and max unavailable = 0.
4.Increase the number of topic partitions and app instances with a slight excess
Having a higher number of partitions will lead to decreased throughput per single partition. Also, having a higher number of app instances, restart of a single one will lead to smaller Kafka lag during rebalancing. Also, make sure that you don't have frequent up-scaling and down-scaling of app instances (as it triggers rebalances). If you have a few up-scaling and down-scaling per hour, seems it's not a good configuration for a minimal number of instances, so you need to increase it.
For more details please take a look at the article Kafka-Streams - Tips on How to Decrease Re-Balancing Impact for Real-Time Event Processing On Highly Loaded Topics
I have built an app which starts multiple RabbitMQ consumers. When I start the app in debug mode in eclipse, I can see desired numbers of threads spawned as can be seen in the Debug window:
The app deals with several RabbitMQ queues plus some seda queues. The app continues executing by processing and moving messages from one queue to another.
There are at least 7 routes starting from RabbitMQ consumer. These routes roughly looks like this:
from("rabbitmq://url")
.process(Processor1.class)
.process(Processor2.class)
There is one specific start queue. Depending upon messages published, the message flow from different sequence of queues. So I was testing different sequence flows by publishing different messages to the start queue. After testing some flows by publishing different messages to the start queue, I realized that it spawned many new threads. And even after that sequence flow finishes (that is the message leaves final queue and the final processor in the camel route also executes completely), the thread is left back in running state. I found many such threads added up after I tested multiple flows. Five such threads can be seen in the screenshot below.
Above are just five extra threads, however this count shoots fast enough as I test multiple complex flow. I have came across 44 threads count. So I was wondering what wrong I am doing. Do I have to explicitly stop the route threads in some way? Did I miss/forget some configuration that I must on the camel route? Why is this happening? Is it normal?
PS: My machine is excessively slow on RAM, just 4GB. It runs two lightweight db servers, two web apps, eclipse and my main (above) aap. Most of the time, 3.7 GBs are full. Some times it takes time for a breakpoint (inside camel processor) to hit when I publish a message in the queue. Can such machine be the reason
for erratic threads leaving behind? (Though I primarily think its me missing some setting on the routes)
I'm writing a Netty application. The application is running on a 64 bit eight core linux box
The Netty application is a simple router that accepts requests (incoming pipeline) reads some metadata from the request and forwards the data to a remote service (outgoing pipeline).
This remote service will return one or more responses to the outgoing pipeline. The Netty application will route the responses back to the originating client (the incoming pipeline)
There will be thousands of clients. There will be thousands of remote services.
I'm doing some small scale testing (ten clients, ten remotes services) and I don't see the sub 10 millisecond performance I'm expecting at a 99.9 percentile. I'm measuring latency from both client side and server side.
I'm using a fully async protocol that is similar to SPDY. I capture the time (I just use System.nanoTime()) when we process the first byte in the FrameDecoder. I stop the timer just before we call channel.write(). I am measuring sub-millisecond time (99.9 percentile) from the incoming pipeline to the outgoing pipeline and vice versa.
I also measured the time from the first byte in the FrameDecoder to when a ChannelFutureListener callback was invoked on the (above) message.write(). The time was a high tens of milliseconds (99.9 percentile) but I had trouble convincing myself that this was useful data.
My initial thought was that we had some slow clients. I watched channel.isWritable() and logged when this returned false. This method did not return false under normal conditions
Some facts:
We are using the NIO factories. We have not customized the worker size
We have disabled Nagel (tcpNoDelay=true)
We have enabled keep alive (keepAlive=true)
CPU is idle 90+% of the time
Network is idle
The GC (CMS) is being invoked every 100 seconds or so for a very short amount of time
Is there a debugging technique that I could follow to determine why my Netty application is not running as fast as I believe it should?
It feels like channel.write() adds the message to a queue and we (application developers using Netty) don't have transparency into this queue. I don't know if the queue is a Netty queue, an OS queue, a network card queue or what. Anyway I'm reviewing examples of existing applications and I don't see any anti-patterns I'm following
Thanks for any help/insight
Netty creates Runtime.getRuntime().availableProcessors() * 2 workers by default. 16 in your case. That means you can handle up to 16 channels simultaneously, other channels will wait untils you release the ChannelUpstreamHandler.handleUpstream/SimpleChannelHandler.messageReceived handlers, so don't do heavy operations in these (IO) threads, otherwise you can stuck the other channels.
You haven't specified your Netty version, but it sounds like Netty 3.
Netty 4 is now stable, and I would advise that you update to it as soon as possible.
You have specified that you want ultra low latency times, as well as tens of thousands of clients and services. This doesn't really mix well. NIO is inherently reasonably latent as opposed to OIO. However the pitfall here is that OIO probably wont be able to reach the number of clients you are hoping for. None the less I would use an OIO event loop / factory and see how it goes.
I myself have a TCP server, which takes around 30ms on localhost to send and receive and process a few TCP packets (measured from the time client opens a socket until server closes it). If you really do require such low latencies I suggest you switch away from TCP due to the SYN/ACK spam that is required to open a connection, this is going to use a large part of your 10ms.
Measuring time in a multi-threaded environment is very difficult if you are using simple things like System.nanoTime(). Imagine the following on a 1 core system:
Thread A is woken up and begins processing the incoming request.
Thread B is woken up and begins processing the incoming request. But since we are working on a 1 core machine, this ultimately requires that Thread A is put on pause.
Thread B is done and performed perfectly fast.
Thread A resumes and finishes, but took twice as long as Thread B. Because you actually measured the time it took to finish for Thread A + Thread B.
There are two approaches on how to measure correctly in this case:
You can enforce that only one thread is used at all times.
This allows you to measure the exact performance of the operation, if the OS does not interfere. Because in the above example Thread B can be outside of your program as well. A common approach in this case is to median out the interference, which will give you an estimation of the speed of your code.You can however assume, that on an otherwise idle multi-core system, there will be another core to process background tasks, so your measurement will usually not be interrupted. Setting this thread to high priority helps as well.
You use a more sophisticated tool that plugs into the JVM to actually measure the atomic executions and time it took for those, which will effectively remove outside interference almost completely. One tool would be VisualVM, which is already integrated in NetBeans and available as a plugin for Eclipse.
As a general advice: it is not a good idea to use more threads than cores, unless you know that those threads will be blocked by some operation frequently. This is not the case when using non-blocking NIO for IO-operations as there is no blocking.
Therefore, in your special case, you would actually reduce the performance for clients, as explained above, because communication would be put on hold up to 50% of the time under high load. In worst case, that could cause a client to even run into a timeout, as there is no guarantee when a thread is actually resumed (unless you explicitly request fair scheduling).