We are experiencing very inconsistent performance when doing an IMap.get() on a particular Hazelcast map.
Our Hazelcast cluster is running version 3.8, has 8 members, and we connect to the cluster as a Hazelcast client. The map we are experiencing problems with has a backup count of 1.
We've isolated the slow operation to single IMap.get operation with logging on both sides of that line of code. The get normally takes milliseconds, but for a few keys it takes between 30 and 50 seconds. We can do numerous get operations on the same map and they all return quickly except for the same few keys. The particular map is relatively small, only about 2000 entries, and is of type <String,String>
If we restart a member in the cluster, we still experience the same latency but with different keys. This seems to indicate an issue with a particular member as the cluster re-balances when we stop/start a member. We've tried stopping each member individually and testing but experience the same symptoms with each member stopped in isolation. We’ve also tried reducing and increasing the number of members in the cluster but experience the same symptoms regardless.
We've confirmed with thread dumps that the generic operation threads are not blocked and have tried increasing the number of operation threads as well as enabling parallization but see no change in behavior. We've also enabled diagnostic logging in the cluster and don't see any obvious issues (no slow operations reported).
Looking at Hazelcast JMX MBeans, the maxGetLatency on the particular map is only about 1 second, much lower than what we are actually experiencing. This seems to indicate an issue with the client connection or underlying network. However, the number of slow keys is only about 1% of the total keys, so unless we are way out of balance, the issue again doesn't seem to be with a single member as you would expect about 1 in 8 keys to be slow. We've also confirmed from the Hazelcast logs that the cluster is stable. Members are not dropping out and rejoining.
Interestingly, if we stop and restart the whole cluster, we get good performance initially but after a few minutes it degrades back to the same state where a few specific IMap.get operations take 30+ seconds.
This exact code is not new and has been running just fine for quite a while. However, once this behavior started, it is consistently reproducible here. As far as we know, there have been no environmental changes.
Is there any diagnostic logging we can enable to get insight about the Hazelcast client? Are there any other diagnostic options available to track down where this latency is coming from? Unfortunately we are not able to reproduce this in any other environment which does seem to point at something either environmental or something unique to the cluster in this environment.
One other potentially interesting thing is that we see the following log statement every 6 seconds in each of the cluster members. The "backup-timeouts:1" is concerning but we aren't sure what it means.
INFO: [IP]:[PORT] [CLUSTER_NAME] [3.8] Invocations:1 timeouts:0 backup-timeouts:1
Any ideas or suggestions on how to debug this further would be very much appreciated.
Copy-pasted from https://github.com/hazelcast/hazelcast/issues/7689
*InvocationFuture.get() has a built-in timeout when remote node doesn't respond at all. It doesn't wait forever. That timeout is defined by system property hazelcast.operation.call.timeout.millis.
When remote node doesn't respond in time, invocation fails with OperationTimeoutException.
Those timeouts are generally because of network problem between caller and remote or system pauses due to high load (GC pauses, OS freezes, IO latency etc).
You can decrease hazelcast.operation.call.timeout.millis to a lower value and enable diagnostics reports to see detailed metrics of the system.*
-Dhazelcast.diagnostics.enabled=true
-Dhazelcast.diagnostics.metric.level=info
-Dhazelcast.diagnostics.invocation.sample.period.seconds=30
-Dhazelcast.diagnostics.pending.invocations.period.seconds=30
-Dhazelcast.diagnostics.slowoperations.period.seconds=30
http://docs.hazelcast.org/docs/latest/manual/html-single/index.html#diagnostics
In my case the cause is waiting on incoming tcp connections on the host.
Related
We have a very demanding application that is split into multiple components, all running on the same physical machine. The machine has 56 cores and 32 GB of RAM.
The application has several components in Java and Scala, and one in Python. All are very intensive:
There's a lot of network IO as the components communicate with other devices in the network.
There's a lot of localhost IO - communication between components.
There's a lot of CPU usage - parsing the data coming over the network and running calculations on it.
Recently, something weird was discovered:
When the components have their log level set to DEBUG, the system runs smoothly. When the log level is set to INFO, the system behaves erratically and seems to have a lot of contention around CPU (workers timing out, messages not being sent between components, etc.).
We do write a lot to DEBUG.
Is it possible that the IO caused by DEBUG (writing to many log files) actually reduces CPU contention and improves system stability?
It's rare, but, yes, that is possible. The reasons for that are extremely varied though. All I can provide is a single example just to proof how this can be possible.
Example: Failure to apply nagle's/metcalfe algorithm.
Let's say we are using an optimistic locking situation somewhere. Some know this as a 'retry based locking system'. For example, postgresql (or just about any other major, modern DB engine) in TransactionIsolation.SERIALIZABLE configuration.
I'll use ethernet as an example.
When you hook up 10 computers on the exact same line, to form a network, there's a problem: If 2 computers send data at the same time, then everybody just reads noise - 2 signals overlapping are useless.
The solution in the 80s was a so-called token-ring network: Some computer would be the 'controller' and would be the only one allowed to 'speak', at least at first, to set up a strategy. It communicates to each other system on the wire which position they have in the 'ring'. The system worked as follows:
Computer 1 can send, and only computer 1. It gets 95ms to send whatever it wants. If it has nothing to send, they can maybe send a 'I cede my time' signal.
Controller computer will send a signal that indicates 'okay, that was it. Next!'. Computer 1 must have already stopped sending. This is merely to ensure every system keeps the same 'clock'. This sync window is 5msec.
Computer 2 now gets 95ms.
.. and so on.
After computer 10, computer 1 can send again.
Note that a full second passes between each 'cycle', so your ping time is on average about 500msec, which is a lot.
This system is both fair and contention free: Assuming no egregiously wrong clocks or misbehaving systems, 2 computers can never talk at the same time, and everybody gets fair use. However, if computer 1 wants to send a file to computer 2, then only 10% of the actual available capacity is used, and ping time is a full second here.
As a consequence, as simple and nice and elegant as it sounds, token ring sucked.
Enter metcalfe and ethernet. His solution was much, much simpler:
Whatever, there is no system, just send whenever you want. Every computer just sends the moment they feel like sending.
To solve the problem of 'noise', all senders also check what they sent. If they notice noise, they know there was a conflict. All systems can detect and will ignore noise, and all senders that detect that their own message ended up as noise will just wait a little bit and resend.
The one problem is that computers are annoyingly reliable: It's like two people who are about to walk into each other on the street, and they both go through a routine where they both lean left, apologize again, both lean left again. To avoid that, network cards actually roll dice (figuratively). The pause between detecting a failed send and re-sending it is [A] exponentially growing (if sending it fails repeatedly, keep waiting longer and longer between sends), and [B] includes a random amount of waiting. That random amount is crucial - it avoids the scenario where the conflicting senders continually wait identical times and hence keep conflicting forever.
However, if too many computers all want to send too much data and you haven't fine-tuned that waiting stuff properly, then at some point your network's performance falls apart as the vast majority of packets are lost due to conflicts.
However, slow down one system and perhaps the conflicts are much more rare or even don't occur at all.
As an example, imagine 10 computers where all computers want to send 1GB to their neighbour, simultaneously. A token ring setup would be faster than ethernet, by a decent margin: The bottleneck is the wire, every computer has stuff to send if they get a 'window'. The token ring is 100% utilized and never wastes time on noise, whereas the ethernet system probably has on the order of 20% of the time 'wasted' due to clashing packets.
Introducing hard locking, which many log frameworks do (they either straight up lock, or the code that writes to disk locks, or the kernel driver that actually ends up writing to disk does, or the log framework invokes fsync to tell the OS to lock until the write went through, because usually the log statements immediately before a hard crash are the most important, and without fsync, you'd lose those) - lets you turn an ethernet-like system into something that's more like token-ring. Even if not hard locking, slowing down a few computers on an ethernet system can speed up the sum total set of jobs.
I doubt you're writing a network kernel driver here. However, the exact same principle can be at work, for example when multiple threads are all reading/writing to an optimistic-locking based DB using TransactionLevel.ISOLATABLE (for example: Postgres) and are properly doing what they need to do (namely: rerun the code that interacts with the DB) when RetryExceptions occur due to contention. The ethernet vs. token ring example just feels like the cleanest way to explain the principle.
Just about every other optimistic locking scenario has the same issue. If there's way more to do than the system has capability to do it, the 'error rate' goes up so far that everything slows down to a crawl, and bizarrely enough slowing certain parts down actually speeds up the total system.
If one of the 'high contention' blocks of code logs to debug and the other does not, then changing the 'ignore all logs below this level' option from INFO to DEBUG introduces fsync and/or disk locks in that block of code which slows it way down, and thus, could lead to the total system actually becoming faster. Even the high contention block (once it IS down with the slow-as-molasses log call, the odds that the job it is actually trying to needs to be retried due to an optimistic lock contention is much, much lower).
Optimistic Locking oversaturation is just one example of 'slow one part down, the total is faster'. There are many more.
But, it is highly likely that some part of your system is wasting tons of time either retrying or inefficiently waiting on locks. For example, busy-waiting, or checking if a lock is now free very often, and the check is expensive – example: You aren't using locks, you have something like:
while (!checkIfSomeSystemIsFree()) {
Thread.sleep(50L);
LOG.debug("foo");
}
if checkifSomeSystemIsFree() is fairly CPU intensive, then many threads all doing the above can trivially be run faster by enabling debug logging.
we are starting to work with Kafka streams, our service is a very simple stateless consumer.
We have tight requirements on latency, and we are facing too high latency problems when the consumer group is rebalancing. In our scenario, rebalancing will happen relatively often: rolling updates of code, scaling up/down the service, containers being shuffled by the cluster scheduler, containers dying, hardware failing.
One of the first tests we have done is having a small consumer group with 4 consumers handling a small amount of messages (1K/sec) and killing one of them; the cluster manager (currently AWS-ECS, probably soon moving to K8S) starts a new one. So, more than one rebalancing is done.
Our most critical metric is latency, which we measure as the milliseconds between message creation in the publisher and message consumption in the subscriber. We saw the maximum latency spiking from a few milliseconds, to almost 15 seconds.
We also have done tests with some rolling updates of code and the results are worse, since our deployment is not prepared for Kafka services and we trigger a lot of rebalancings. We'll need to work on that, but wondering what are the strategies followed by other people for doing code deployment / autoscaling with the minimum possible delays.
Not sure it might help, but our requirements are pretty relaxed related to message processing: we don't care about some messages being processed twice from time to time, or are very strict with the ordering of messages.
We are using all default configurations, no tuning.
We need to improve this latency spikes during rebalancing.
Can someone, please, give us some hints on how to work on it? Is touching configurations enough? Do we need to use some concrete parition Asignor? Implement our own?
What is the recommended approach to code deployment / autoscaling with the minimum possible delays?
Our Kafka version is 1.1.0, after looking at libs found for example kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0.
In the consumer side, we are using Kafka-streams 2.1.0.
Thank you for reading my question and your responses.
If the gap is introduced mainly from the rebalance, meaning that not triggering the rebalance but just left AWS / K8s to do their work and resume the bounced instance and pay the unavailability period of time during the bounce --- note that for stateless instances this is usually better, while for stateful applications you'd better make sure the restarted instance can access to its associated storage so that it can save on bootstrapping from the changelog.
To do that:
In Kafka 1.1, to reduce the unnecessary rebalance you can increase the session timeout of the group so that coordinator became "less sensitive" about members not responding with heartbeats --- note that we disabled the leave.group request since 0.11.0 for Streams' consumers (https://issues.apache.org/jira/browse/KAFKA-4881) so if we have a longer session timeout, the member leaving the group would not trigger rebalance, though member rejoining would still trigger one. Still one rebalance less is better than none.
In the coming Kafka 2.2 though, we've done a big improvement on optimizing rebalance scenarios, primarily captured in KIP-345 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Introduce+static+membership+protocol+to+reduce+consumer+rebalances). With that much fewer rebalances will be triggered with a rolling bounce, with a reasonable config settings introduced in KIP-345. So I'd strongly recommend you to upgrade to 2.2 and see if it helps your case
There are several configuration changes required in order to significantly decrease rebalance latency, especially during deployment rollout
1.Keep the latest version of Kafka-Streams
Kafka-Streams rebalance performance becomes better and better over time. A feature improvement that worth highlighting is Incremental cooperative rebalancing protocol. Kafka-Streams has this feature out of the box (since version 2.4.0, and with some improvements at 2.6.0), with default partition assignor StreamsPartitionAssignor.
2.Add Kafka-Streams configuration property internal.leave.group.on.close = true for sending consumer leave group request on app shutdown
By default, Kafka-Streams doesn't send consumer leave group request on app graceful shutdown, and, as a result, messages from some partitions (that were assigned to terminating app instance) will not be processed until session by this consumer will expire (with duration session.timeout.ms), and only after expiration, new rebalance will be triggered. In order to change such default behavior, we should use the internal Kafka Streams config property internal.leave.group.on.close = true (this property should be added during Kafka Streams creation new KafkaStreams(streamTopology, properties)). As the property is private, be careful and double-check before upgrading to a new version if the config is still there.
3.Decrease the number of simultaneously restarted app instances during deployment rollout
Using Kubernetes, we could control how many app instances are created with a new deployment at the same time. It's achievable by using properties max surge and max unavailable. If we have tens of app instances, default configuration will rollout multiple new instances and at the same time, multiple instances will be terminating. It means that multiple partitions will require reassignment to other app instances, and multiple rebalances will be fired, and it will lead to significant rebalance latency. The most preferable configuration for decreasing rebalance duration is changing these configurations to max surge = 1 and max unavailable = 0.
4.Increase the number of topic partitions and app instances with a slight excess
Having a higher number of partitions will lead to decreased throughput per single partition. Also, having a higher number of app instances, restart of a single one will lead to smaller Kafka lag during rebalancing. Also, make sure that you don't have frequent up-scaling and down-scaling of app instances (as it triggers rebalances). If you have a few up-scaling and down-scaling per hour, seems it's not a good configuration for a minimal number of instances, so you need to increase it.
For more details please take a look at the article Kafka-Streams - Tips on How to Decrease Re-Balancing Impact for Real-Time Event Processing On Highly Loaded Topics
I'm just curious how to solve the connection-pooling problem in the scalable java application.
Imagine I have java web application with HikariCP set up (max pool size is 20) and PosgtreSQL with max allowed connections 100.
And now I want to implement scalability approach for my web app (no matter how) end even with autoscaling. So I don't know how many web app replicas will be eventually, it may dynamically change (caused by some reasons e.g. cluster workload).
But there is the problem. When I create more then 5 web app replicas cause my total connection count exceeds max allowed connection.
Are there any best practices to solve this problem (except evident increasing max allowed connections/decreasing pool size)?
Thanks
You need an orchestrator over the web application. It would be responsible for the scaling in-out and it will manage the connections in order not to exceed the limitation of 100. It will open-close the connections according to the traffic.
Nevertheless, my recommendation is to take into consideration the migration into a no-SQL database which is more suitable solution for scalability and performance.
I'll start by saying that whatever you do, as long as you're restricted by 100 connections to your DB - it will not scale!
That said, you can optimize and "squeeze" performance out of it by applying a couple of known tricks. It's important to understand the trade-offs (availability vs. consistency, latency vs. throughput and etc):
Caching: if you can anticipate certain select queries you can calculate them offline (maybe even from a replica?) and cache the results. The tradeoff: the user might get results which are not up-to-date
Buffering/throttling: all updates/inserts go to a queue and there are only a few workers which are allowed to pull from the queue and update the DB. Tradeoff: you get more availability but becomes "eventually consistent" (since updates won't be visible right away).
It might come to that you'll have to run the selects in async manner as well, which means that the user submits a query, and when it's ready it'll be "pushed" back to the client (or the client can keep "polling" every few seconds). It can be implemented with a callback as well.
By separating the updates (writes) from reads you'll be able to get more performance by creating replicas that are "read only" and which can be used by the webservers for read-queries.
I have a scenario with these particular demands:
Production ready & stable.
Point to point connection, with the producer behind a firewall and a consumer in the cloud. It might be possible to split the traffic between a couple of producers\consumers, but all the traffic still has to traverse a single WAN connection which will probably be the bottleneck.
High throughput - something along the order of 300 Mb/sec (may be up to 1Gb!). Message sizes vary from ~1KB to possibly several MBs.
Guaranteed delivery a must - every message has to arrive at the consumer eventually, so we need to start saving messages to disk in the event of a momentary network outage or risk running out of memory.
Message order is not important, messages are timestamped and can be re-arranged at the consumer.
Highly preferable but not as important - should run on both linux & windows (JVM seems the obvious choice)
I've been looking at so many MQs lately, and I don't have any hands-on experience with any.
Thought it will be a better idea to ask someone with experience.
We're considering mostly Kafka, but I'm not sure it's the best for our use case, seems to be tailored to distributed deployment & mutliple topics\consumers\producers. Also, definitely not production ready on windows.
What about Apache ActiveMQ or Apollo\Artemis? RabbitMQ seems not to be a good fit for our performance requirements. Or maybe there's some Java library that has the features we need without a middleman broker?
Any help making sense of this kludge would be greatly appreciated.
If anyone comes across this, we went with Kafka in the end. Its performance is impressive and so far it's very stable on linux. No attempt yet to run it on windows in production deployments.
UPDATE 12/3/2017:
Works fine and very stable on Linux, but on Windows this is not usable in production. Old data never gets deleted due to leaky file handles, the relevant Jira is being ignored since 2013: https://issues.apache.org/jira/browse/KAFKA-1194
We are currently preparing hazelcast for going live in the next weeks. There is still one bigger issue left, that troubles our OPs department and could be a possible show stopper in case we cannot fix it.
Since we are maintaining a high availability payment application, we have to survive in case the cluster ist not available. Reasons could be:
Someone messed up the hazelcast configuration and a map on the cluster increases until we have OOM (had this on the test system).
There is some issue with the network cards/hardware that temporary breaks the connection to the cluster
OPs guys reconfigured the firewall and accidentaly blocked some ports that are necessary, whatosoever.
Whatever else
I spent some time on finding good existing solution, but the only solution so far was to increase the number of backup servers, which of course does not solve the case.
During my current tests the application completely stopped working because after certain retries the clients disconnect from the cluster and the hibernate 2nd level cache is no longer working. Since we are using hazelcast throughout the whole ecosystem this would kill 40 java clients almost instantly.
Thus I wonder how we could achieve that the applications are still working in a of course slower manner when the cluster is down. Our current approach is to switch over to ehcache local cache but I think there should be hazelcast solution for that problem as well?
If I were you I would use a LocalSessionFactoryBean and set the cacheRegionFactory to a Spring Bean that can delegate a call to either Hazelcast or a NoCachingRegionFactory, if the Hazelcast server is down.
This is desirable, since Hibernate assumes the Cache implementation is always available, so you need to provide your own CacheRegion proxy that can decide the cache region routing at runtime.