Kafka Streams rebalancing latency spikes on high throughput kafka-streams services

Kafka Streams rebalancing latency spikes on high throughput kafka-streams services - java

we are starting to work with Kafka streams, our service is a very simple stateless consumer.
We have tight requirements on latency, and we are facing too high latency problems when the consumer group is rebalancing. In our scenario, rebalancing will happen relatively often: rolling updates of code, scaling up/down the service, containers being shuffled by the cluster scheduler, containers dying, hardware failing.
One of the first tests we have done is having a small consumer group with 4 consumers handling a small amount of messages (1K/sec) and killing one of them; the cluster manager (currently AWS-ECS, probably soon moving to K8S) starts a new one. So, more than one rebalancing is done.
Our most critical metric is latency, which we measure as the milliseconds between message creation in the publisher and message consumption in the subscriber. We saw the maximum latency spiking from a few milliseconds, to almost 15 seconds.
We also have done tests with some rolling updates of code and the results are worse, since our deployment is not prepared for Kafka services and we trigger a lot of rebalancings. We'll need to work on that, but wondering what are the strategies followed by other people for doing code deployment / autoscaling with the minimum possible delays.
Not sure it might help, but our requirements are pretty relaxed related to message processing: we don't care about some messages being processed twice from time to time, or are very strict with the ordering of messages.
We are using all default configurations, no tuning.
We need to improve this latency spikes during rebalancing.
Can someone, please, give us some hints on how to work on it? Is touching configurations enough? Do we need to use some concrete parition Asignor? Implement our own?
What is the recommended approach to code deployment / autoscaling with the minimum possible delays?
Our Kafka version is 1.1.0, after looking at libs found for example kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0.
In the consumer side, we are using Kafka-streams 2.1.0.
Thank you for reading my question and your responses.

If the gap is introduced mainly from the rebalance, meaning that not triggering the rebalance but just left AWS / K8s to do their work and resume the bounced instance and pay the unavailability period of time during the bounce --- note that for stateless instances this is usually better, while for stateful applications you'd better make sure the restarted instance can access to its associated storage so that it can save on bootstrapping from the changelog.
To do that:
In Kafka 1.1, to reduce the unnecessary rebalance you can increase the session timeout of the group so that coordinator became "less sensitive" about members not responding with heartbeats --- note that we disabled the leave.group request since 0.11.0 for Streams' consumers (https://issues.apache.org/jira/browse/KAFKA-4881) so if we have a longer session timeout, the member leaving the group would not trigger rebalance, though member rejoining would still trigger one. Still one rebalance less is better than none.
In the coming Kafka 2.2 though, we've done a big improvement on optimizing rebalance scenarios, primarily captured in KIP-345 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Introduce+static+membership+protocol+to+reduce+consumer+rebalances). With that much fewer rebalances will be triggered with a rolling bounce, with a reasonable config settings introduced in KIP-345. So I'd strongly recommend you to upgrade to 2.2 and see if it helps your case

There are several configuration changes required in order to significantly decrease rebalance latency, especially during deployment rollout
1.Keep the latest version of Kafka-Streams
Kafka-Streams rebalance performance becomes better and better over time. A feature improvement that worth highlighting is Incremental cooperative rebalancing protocol. Kafka-Streams has this feature out of the box (since version 2.4.0, and with some improvements at 2.6.0), with default partition assignor StreamsPartitionAssignor.
2.Add Kafka-Streams configuration property internal.leave.group.on.close = true for sending consumer leave group request on app shutdown
By default, Kafka-Streams doesn't send consumer leave group request on app graceful shutdown, and, as a result, messages from some partitions (that were assigned to terminating app instance) will not be processed until session by this consumer will expire (with duration session.timeout.ms), and only after expiration, new rebalance will be triggered. In order to change such default behavior, we should use the internal Kafka Streams config property internal.leave.group.on.close = true (this property should be added during Kafka Streams creation new KafkaStreams(streamTopology, properties)). As the property is private, be careful and double-check before upgrading to a new version if the config is still there.
3.Decrease the number of simultaneously restarted app instances during deployment rollout
Using Kubernetes, we could control how many app instances are created with a new deployment at the same time. It's achievable by using properties max surge and max unavailable. If we have tens of app instances, default configuration will rollout multiple new instances and at the same time, multiple instances will be terminating. It means that multiple partitions will require reassignment to other app instances, and multiple rebalances will be fired, and it will lead to significant rebalance latency. The most preferable configuration for decreasing rebalance duration is changing these configurations to max surge = 1 and max unavailable = 0.
4.Increase the number of topic partitions and app instances with a slight excess
Having a higher number of partitions will lead to decreased throughput per single partition. Also, having a higher number of app instances, restart of a single one will lead to smaller Kafka lag during rebalancing. Also, make sure that you don't have frequent up-scaling and down-scaling of app instances (as it triggers rebalances). If you have a few up-scaling and down-scaling per hour, seems it's not a good configuration for a minimal number of instances, so you need to increase it.
For more details please take a look at the article Kafka-Streams - Tips on How to Decrease Re-Balancing Impact for Real-Time Event Processing On Highly Loaded Topics

Related

Akka Cluster heartbeat delays on Kubernetes

Our Scala application (Kubernetes deployment) constantly experience Akka Cluster heartbeat delays of ≈3s.
Once we even had a 200s delay which also manifested itself in the following graph:
Can someone suggest things to investigate further?
Specs
Kubernetes 1.12.5
requests.cpu = 16
# limits.cpu not set
Scala 2.12.7
Java 11.0.4+11
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:+AlwaysPreTouch
-Xlog:gc*,safepoint,gc+ergo*=trace,gc+age=trace:file=/data/gc.log:time,level,tags:filecount=4,filesize=256M
-XX:+PerfDisableSharedMem
Akka Cluster 2.5.25
Java Flight Recording
Some example:
timestamp delay_ms
06:24:55.743 2693
06:30:01.424 3390
07:31:07.495 2487
07:36:12.775 3758
There were 4 suspicious time points where lots of Java Thread Park events were
registered simultaneously for Akka threads (actors & remoting)
and all of them correlate to heartbeat issues:
Around 07:05:39 there were no "heartbeat was delayed" logs, but was this one:
07:05:39,673 WARN PhiAccrualFailureDetector heartbeat interval is growing too large for address SOME_IP: 3664 millis
No correlation with halt events or blocked threads were found during
Java Flight Recording session, only two Safepoint Begin events
in proximity to delays:
CFS throttling
The application CPU usage is low, so we thought it could be related to
how K8s schedule our application node for CPU.
But turning off CPU limits haven't improved things much,
though kubernetes.cpu.cfs.throttled.second metric disappeared.
Separate dispatcher
Using a separate dispatcher seems to be unnecessary since delays happen even when
there is no load, we also built an explicit application similar to our own which
does nothing but heartbeats and it still experience these delays.
K8s cluster
From our observations it happens way more frequently on a couple of K8s nodes in
a large K8s cluster shared with many other apps when our application doesn't loaded much.
A separate
dedicated K8s cluster where our app is load tested almost have no issues with
heartbeat delays.

Have you been able to rule out garbage collection? In my experience, that's the most common cause for delayed heartbeats in JVM distributed systems (and the CFS quota in a Kubernetes/Mesos environment can make non-Stop-The-World GCs effectively STW, especially if you're not using a really recent (later than release 212 of JDK8) version of openjdk).
Every thread parking before "Safepoint begin" does lead me to believe that GC is in fact the culprit. Certain GC operations (e.g. rearranging the heap) require every thread to be in a safepoint, so every so often when not blocked, threads will check if the JVM wants them to safepoint; if so the threads park themselves in order to get to a safepoint.
If you've ruled out GC, are you running in a cloud environment (or on VMs where you can't be sure that the CPU or network aren't oversubscribed)? The akka-cluster documentation suggests increasing the akka.cluster.failure-detector.threshold value, which defaults to a value suitable for a more controlled LAN/bare-metal environment: 12.0 is recommended for cloud environments. This won't prevent delayed heartbeats, but it will decrease the chances of a spurious downing event because of a single long heartbeat (and also delay responses to genuine node loss events). If you want to tolerate a spike in heartbeat inter-arrival times from 1s to 200s, though, you'll need a really high threshold.

How many operating system resources is needed for one Java Kafka Consumer?

I want to use hundreds of thousands of KafkaConsumer. For example, I need 100_000 consumers for some architectural pattern. I am thinking, is it OK? Or should I to refactor my system and use few consumers for the whole system (for example, 10 consumers instead of 100_000).
So, my questions are:
Is there connection pool in KafkaConsumer, or each consumer creates own connection to kafka brokers?
Is there thread pool in KafkaConsumer, or each consumer creates own thread (I hope, it does not).
What is average memory consumption per KafkaConsumer?
What do you think about such architectural pattern?

1,2) Consumers request metadata from one of the brokers which is the leader of the partition. Each consumer is able to handle all IO from a single thread as the Java clients are designed around an event loop which is driven by the poll(). You can also build multi-threaded consumers but you'd need take care of offset management. Refer to Confluent's documentation for more details regarding the implementation of Java Clients.
3) According to Apache Kafka and Confluent Enterprise Reference architecture,
Consumers use at least 2MB per consumer and up to 64MB in cases of
large responses from brokers (typical for bursty traffic)
4) The number of consumers you've mentioned is huge so you'd need a very good reason to go for 100,000 consumers. It depends on the scenario though, but even Netflix should be using a lot less than that.

Hazelcast IMap#get periodic extreme latency

We are experiencing very inconsistent performance when doing an IMap.get() on a particular Hazelcast map.
Our Hazelcast cluster is running version 3.8, has 8 members, and we connect to the cluster as a Hazelcast client. The map we are experiencing problems with has a backup count of 1.
We've isolated the slow operation to single IMap.get operation with logging on both sides of that line of code. The get normally takes milliseconds, but for a few keys it takes between 30 and 50 seconds. We can do numerous get operations on the same map and they all return quickly except for the same few keys. The particular map is relatively small, only about 2000 entries, and is of type <String,String>
If we restart a member in the cluster, we still experience the same latency but with different keys. This seems to indicate an issue with a particular member as the cluster re-balances when we stop/start a member. We've tried stopping each member individually and testing but experience the same symptoms with each member stopped in isolation. We’ve also tried reducing and increasing the number of members in the cluster but experience the same symptoms regardless.
We've confirmed with thread dumps that the generic operation threads are not blocked and have tried increasing the number of operation threads as well as enabling parallization but see no change in behavior. We've also enabled diagnostic logging in the cluster and don't see any obvious issues (no slow operations reported).
Looking at Hazelcast JMX MBeans, the maxGetLatency on the particular map is only about 1 second, much lower than what we are actually experiencing. This seems to indicate an issue with the client connection or underlying network. However, the number of slow keys is only about 1% of the total keys, so unless we are way out of balance, the issue again doesn't seem to be with a single member as you would expect about 1 in 8 keys to be slow. We've also confirmed from the Hazelcast logs that the cluster is stable. Members are not dropping out and rejoining.
Interestingly, if we stop and restart the whole cluster, we get good performance initially but after a few minutes it degrades back to the same state where a few specific IMap.get operations take 30+ seconds.
This exact code is not new and has been running just fine for quite a while. However, once this behavior started, it is consistently reproducible here. As far as we know, there have been no environmental changes.
Is there any diagnostic logging we can enable to get insight about the Hazelcast client? Are there any other diagnostic options available to track down where this latency is coming from? Unfortunately we are not able to reproduce this in any other environment which does seem to point at something either environmental or something unique to the cluster in this environment.
One other potentially interesting thing is that we see the following log statement every 6 seconds in each of the cluster members. The "backup-timeouts:1" is concerning but we aren't sure what it means.
INFO: [IP]:[PORT] [CLUSTER_NAME] [3.8] Invocations:1 timeouts:0 backup-timeouts:1
Any ideas or suggestions on how to debug this further would be very much appreciated.

Copy-pasted from https://github.com/hazelcast/hazelcast/issues/7689
*InvocationFuture.get() has a built-in timeout when remote node doesn't respond at all. It doesn't wait forever. That timeout is defined by system property hazelcast.operation.call.timeout.millis.
When remote node doesn't respond in time, invocation fails with OperationTimeoutException.
Those timeouts are generally because of network problem between caller and remote or system pauses due to high load (GC pauses, OS freezes, IO latency etc).
You can decrease hazelcast.operation.call.timeout.millis to a lower value and enable diagnostics reports to see detailed metrics of the system.*
-Dhazelcast.diagnostics.enabled=true
-Dhazelcast.diagnostics.metric.level=info
-Dhazelcast.diagnostics.invocation.sample.period.seconds=30
-Dhazelcast.diagnostics.pending.invocations.period.seconds=30
-Dhazelcast.diagnostics.slowoperations.period.seconds=30
http://docs.hazelcast.org/docs/latest/manual/html-single/index.html#diagnostics
In my case the cause is waiting on incoming tcp connections on the host.

How to ensure java clients continue "working" in case whole hazelcast cluster is down

We are currently preparing hazelcast for going live in the next weeks. There is still one bigger issue left, that troubles our OPs department and could be a possible show stopper in case we cannot fix it.
Since we are maintaining a high availability payment application, we have to survive in case the cluster ist not available. Reasons could be:
Someone messed up the hazelcast configuration and a map on the cluster increases until we have OOM (had this on the test system).
There is some issue with the network cards/hardware that temporary breaks the connection to the cluster
OPs guys reconfigured the firewall and accidentaly blocked some ports that are necessary, whatosoever.
Whatever else
I spent some time on finding good existing solution, but the only solution so far was to increase the number of backup servers, which of course does not solve the case.
During my current tests the application completely stopped working because after certain retries the clients disconnect from the cluster and the hibernate 2nd level cache is no longer working. Since we are using hazelcast throughout the whole ecosystem this would kill 40 java clients almost instantly.
Thus I wonder how we could achieve that the applications are still working in a of course slower manner when the cluster is down. Our current approach is to switch over to ehcache local cache but I think there should be hazelcast solution for that problem as well?

If I were you I would use a LocalSessionFactoryBean and set the cacheRegionFactory to a Spring Bean that can delegate a call to either Hazelcast or a NoCachingRegionFactory, if the Hazelcast server is down.
This is desirable, since Hibernate assumes the Cache implementation is always available, so you need to provide your own CacheRegion proxy that can decide the cache region routing at runtime.

How does ActiveMQ ProducerFlowControl work for Async Topics?

We are using ActiveMQ 5.3.1 as a standalone broker in our system, and every so often we get a big spike in messages (intentional, for example on failover, we re-subscribe). We currently have ProducerFlowControl turned on, as this seemed a sensible way to stop components from falling over during these spikes.
However, it seems we have an issue with the Flow Control - once it kicks in, the Producers seem to lock indefinitely, even once all inflight messages have been consumed. As soon as we see the message
Usage Manager memory limit (1048576) reached
Our producers can no longer send any messages to the topic. This seems odd - I thought it would be more of a "one-in-one-out" policy. I read somewhere that FlowControl does not work very well for Async Topic producers (which is exactly what we have) so I am wondering if there is a better way to configure this?
Also, how long does Flow Control last once it has kicked in? Will it throttle producers on that topic forever (until ActiveMQ is restarted? until producers are restarted?) or does it last a fixed or configurable amount of time (eg it waits for consumer to empty the topic, then waits 5 minutes)?
Any help would be appreciated. We are currently investigating turning Flow Control off and using File-based cursors instead. Any obvious downside with that approach?

Producer Flow Control occurs when you hit one of the size limits (memory, disk, etc.) for either the entire broker or a single destination. Once you hit it, producers on that destination (or the whole broker, depending on which limit you hit) are unable to send more messages until enough space frees up to hold the next one. So your one-out-one-in mental model is the right one (though if the messages are of different sizes, then it might not be truly one-for-one). This will continue for as long as the limit continues to be hit because producers are faster than consumers; it's not time-based, and it's not forever, just until consumers start catching up and you're not running into any limits.
If you're hitting PFC, it could mean that you haven't properly sized some limit in your broker to handle bursts of data or periods where consumers are offline (so you should stop and do that, not turn off PFC), or that you have a systemic problem where your producers are always going to outrun your consumers (so you need to either speed up your consumers, slow down your producers, find a way to allow multiple consumers to consume messages in parallel, or configure the broker to drop some of your messages for you).
Or it could mean that your topics have some leftover durable subscriptions for consumers that are currently (permanently?) offline, and the broker has to keep those messages so it can deliver them when the consumer comes back online (which it might never do). Because the broker's got to hold onto those messages, it can't allow any new ones to be sent, even though the currently online consumers have processed their copies of all of the messages. This one's my best guess based on what you've written, though it could certainly be other things instead.
In any case, PFC kicking in is almost certainly just a secondary symptom of something else wrong with how you've designed your system, and you should figure that out and fix it rather than just turning off PFC. (After all, if you're perpetually producing faster than you can consume, writing messages to disk is just going to run you out of disk space if you turn out of PFC.)
Most importantly, 5.3.1 is VERY, VERY old, and there have been a TON of improvements between then and 5.10.0 (the current released version). I wouldn't even consider running a broker on anything before 5.5.1 because of all of the bugs that were found in the versions before it, but really I wouldn't recommend running on anything before 5.8.0 and there'd have to be a good reason not to just go all the way to 5.10.0. And you're certainly not going to get any support for 5.3.1 from the community if you hit something that you think might be a bug, so do yourself a favor and upgrade to a version that was released more recently than 2010. (And hopefully just upgrading will fix the bad behavior you're seeing, but if not, you'll be in a better position to ask for someone to troubleshoot the problem.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.