I have an AWS hosted ElasticSearch cluster, which fails when heap reaches 75% and the (CMS) garbage collector runs.
The cluster runs ES version 7.9 with 3 dedicated Master nodes (r5.large.elasticsearch) and 4 Data nodes (r5.xlarge.elasticsearch)
That is:
4 vCPU / 32GB instance per Data Node (16GB heap), with 1TB of SDD storage each, for a total of 4TB storage.
2 vCUP / 16GB instance per Master node
The cluster holds 33 indices with 1-3 primary shards each and 0-1 replicas (0 for the older ones), and a size ranging between 50Mb to 60Gb per shard, but in general each shard stores 30gb.
So about 65 shards in total.
Whenever the JVM Memory Pressure goes up to 75% and the Garbage Collector (GC) runs we start to get Timeouts and the node running the GC goes down for a moment and then back up, causing shards reallocation, more timeouts, increased index and search latencies.
Checking the error logs we could see a lot of:
[WARN ][o.e.m.j.JvmGcMonitorService] [node_id] [gc][2315905] overhead, spent [6.4s] collecting in the last [7.2s]
[WARN ][o.e.m.j.JvmGcMonitorService] [node_id] [gc][2315905] overhead, spent [3.6s] collecting in the last [4.4s]
...
On peak hours our Indexing rate is about 4k operations/min and search rate is 1k operations/min
The GC runs about 3 times a day per data node, about 12 times a day per cluster, and the maximum Heap percent among the 4 data nodes oscillates between 35% and 75%, it never goes above 75%. When the GC is not running, CPU stays consistently at an average of 13% - 15%, so we’re highly confident that the instance size is the appropriate one for our current traffic.
Followed some guides on how to avoid node crashes, but:
Rarely aggregate on text fields.
No complex aggregations.
Shards are evenly distributed, and the number of shards per index seems to be correct.
Very small number of wildcard queries, which are manually triggered.
All the documents are small-medium sized (500 - 1000 characters).
So, any ideas on what could possibly be causing these crashes and long GC runs?
Found some related questions with no answer such as this
Related
I am using an Infinispan cache to store values. The code writes to the cache every 10 minutes and the cache reaches a size of about 400mb.
It has a time to live of about 2 hours, and the maximum entries is 16 million although currently in my tests the number of entries doesn't go above 2 million or so (I can see this by checking the mbeans/metrics in jconsole).
When I start jboss the java heap size is 1.5Gb to 2Gb. The -Xmx setting for the maximum allocated memory to jboss is 4Gb.
When I disable the Infinispan cache the heap memory usage stays flat at around 1.5Gb to 2Gb. It is very constant and stays at that level.
=> The problem is: when I have the Infinispan cache enabled the java heap size grows to about 3.5Gb/4Gb which is way more than expected.
I have done a heap dump to check the size of the cache in Eclipse MAT and it is only 300 or 400mb (which is ok).
So I would expect the memory usage to go to 2.5Gb and stay steady at that level, since the initial heap size is 2Gb and the maximum cache size should only be around 500mb.
However it continues to grow and grow over time. Every 2 or 3 hours a garbage collection is done and that brings the usage down to about 1 or 1.5Gb but it then increases again within 30 minutes up to 3.5Gb.
The number of entries stays steady at about 2 million so it is not due to just more entries going in to the cache. (Also the number of evictions stays at 0).
What could be holding on to this amount of memory if the cache is only 400-500mb?
Is it a problem with my garbage collection settings? Or should I look at Infinispan settings?
Thanks!
Edit: you can see the heap size over time here.
What is strange is that even after what looks like a full GC, the memory shoots back up again to 3Gb. This corresponds to more entries going into the cache.
Edit: It turns out this has nothing to do with Infinispan. I narrowed down the problem to a single line of code that is using a lot of memory (about 1Gb more than without the call).
But I do think more and more memory is being taken by the Infinispan cache, naturally because more entries are being added over the 2 hour time to live.
I also need to have upwards of 50 users query on Infinispan. When the heap reaches a high value like this (even without the memory leak mentioned above), I know it's not an error scenario in java however I need as much memory available as possible.
Is there any way to "encourage" a heap dump past a certain point? I have tried using GC options to collect at a given proportion of heap for the old gen but in general the heap usage tends to creep up.
Probably what you're seeing is the JVM not collecting objects which have been evicted from the cache. Cache's in general have a curious relationship with the prevailing idea of generational GC.
The generational GC idea is that, broadly speaking, there are two types of objects in the JVM - short lived ones, which are used and thrown away quickly, and longer lived ones, which are usually used throughout the lifetime of the application. In this model you want to tune your GC so that you put most of your effort attempting to identify the short lived objects. This means that you avoid looking at the long-lived objects as much as possible.
Cache's disrupt this pattern by having some intermediate-length object lifespans (i.e. a few seconds / minutes / hours, depending on your cache). These objects often get promoted to the tenured generation, where they're not usually looked at until a full GC becomes necessary, even after they've been evicted from the cache.
If this is what's happening then you've a couple of choices:
ignore it, let the full GC semantics do its thing and just be aware that this is what's happening.
try to tune the GC so that it takes longer for objects to get promoted to the tenured generation. There are some GC flags which can help with that.
Recently I have been playing with Cassandra.
I have been experienced latency spikes when adding nodes to Cassandra while nodetool stream limits are set on all existing C* nodes.
To be specific, originally the cluster has 4 C* nodes and I am adding 2 additional nodes when the original ones are warmed up at 1200 s as shown in the figure.
The amount of data stored is 50 GB on these 4 nodes and the key size is 20 KB each.
Nodetool is used to set the 'stream limits' to 1MB/s.
YCSB is used to generate read dominant (90%) workloads at 80% of the maximum throughput that can be reach by these 4 existing nodes through out the scale up procedure.
The figures shows the output service latency from YCSB every 10 second.
time vs. read latency on C*
Does anyone has some answers to the latency spike?
Maybe is the gc or compaction in the background?
Or just the bandwidth saturated, which does not seem so since I have set stream limits to 1MB/s?
I have a Java server running on a beefy server machine (many cores, 64Gb RAM, etc.) and I submit some workloads to it in test scenario; I submit one workload, exactly the same, 10 times in a row in each test. One one particular workload, I observe that in the middle of 10 runs, it takes much longer to complete (i.e. runs 1-2 - 10sec, 3 - 12sec, 4 - 25sec., 5 - 10sec., etc.). In yourkit profile with wall time from the server, I see no increase in IO, GC, network, or pretty much anything during the slowdown; no particular methods increase in proportion of time spent - every method is just slower, roughly in proportion. What I do see is that average CPU usage decreases (presumably because same work is spread over more time), but kernel CPU usage increases - from 0-2% on faster workloads, to 9-12% on slow one. Kernel usage crawls slowly up from the end of the previous workload, which is slightly slower, stays high, then drops between the slow and next workload (there's a pause). I cannot map this kernel CPU to any calls from yourkit.
Does anyone have an idea what this can be? Or suggest further venues of investigation that might show where kernel time goes?
I'm trying to improve query performance. It takes an average of about 3 seconds for simple queries which don't even touch a nested document, and it's sometimes longer.
curl "http://searchbox:9200/global/user/_search?n=0&sort=influence:asc&q=user.name:Bill%20Smith"
Even without the sort it takes seconds. Here are the details of the cluster:
1.4TB index size.
210m documents that aren't nested (About 10kb each)
500m documents in total. (nested documents are small: 2-5 fields).
About 128 segments per node.
3 nodes, m2.4xlarge (-Xmx set to 40g, machine memory is 60g)
3 shards.
Index is on amazon EBS volumes.
Replication 0 (have tried replication 2 with only little improvement)
I don't see any noticeable spikes in CPU/memory etc. Any ideas how this could be improved?
Garry's points about heap space are true, but it's probably not heap space that's the issue here.
With your current configuration, you'll have less than 60GB of page cache available, for a 1.5 TB index. With less than 4.2% of your index in page cache, there's a high probability you'll be needing to hit disk for most of your searches.
You probably want to add more memory to your cluster, and you'll want to think carefully about the number of shards as well. Just sticking to the default can cause skewed distribution. If you had five shards in this case, you'd have two machines with 40% of the data each, and a third with just 20%. In either case, you'll always be waiting for the slowest machine or disk when doing distributed searches. This article on Elasticsearch in Production goes a bit more in depth on determining the right amount of memory.
For this exact search example, you can probably use filters, though. You're sorting, thus ignoring the score calculated by the query. With a filter, it'll be cached after the first run, and subsequent searches will be quick.
Ok, a few things here:
Decrease your heap size, you have a heap size of over 32gb dedicated to each Elasticsearch instance on each platform. Java doesn't compress pointers over 32gb. Drop your nodes to only 32gb and, if you need to, spin up another instance.
If spinning up another instance instance isn't an option and 32gb on 3 nodes isn't enough to run ES then you'll have to bump your heap memory to somewhere over 48gb!
I would probably stick with the default settings for shards and replicas. 5 shards, 1 replica. However, you can tweak the shard settings to suit. What I would do is reindex the data in several indices under several different conditions. The first index would only have 1 shard, the second index would have 2 shards, I'd do this all the way up to 10 shards. Query each index and see which performs best. If the 10 shard index is the best performing one keep increasing the shard count until you get worse performance, then you've hit your shard limit.
One thing to think about though, sharding might increase search performance but it also has a massive effect on index time. The more shards the longer it takes to index a document...
You also have quite a bit of data stored, maybe you should look at Custom Routing too.
I currently have a program that benefits greatly from multithreading. It starts n threads, each thread does 100M iterations. They all use shared memory but there is no synchronization at all.
It approximates some equation solutions and current benchmarks are:
1 thread: precision 1 time: 150s
4 threads: precision 4 time: 150s
16 threads: precision 16 time: 150s
32 threads: precision 32 time: 210s
64 threads: precision 64 time: 420s
(Higher precision is better)
I use Amazon EC2 'Cluster Compute Eight Extra Large Instance' which has 2 x Intel Xeon E5-2670
As far as I understand, it has 16 real cores, thus program has linear improvement up to 16 cores.
Also it has 2x 'hyper-threading' and my program gains somewhat from this. Making number of threads more than 32 is obviously gives no improvement.
These benchmarks prove that access to RAM is not 'bottleneck'.
Also I ran the program on Intel Xeon E5645 which has 12 real cores. Results are:
1 thread: precision 1 time: 150s
4 threads: precision 4 time 150s
12 threads: precision 12 time 150s
24 threads: precision 24 time 220s
precision/(time*thread#) is similar to Amazon computer, which is not clear for me, because each core in Xeon E5-2670 is ~1.5 faster according to cpu MHz (~1600 vs ~2600) and
http://www.cpubenchmark.net/cpu_list.php 'Passmark CPU Mark' number adjusted for
Why using faster processor does not improve single-threaded performance while increasing number of threads does?
Is it possible to rent some server that will have Multiple CPU more powerful than 2 x Intel Xeon E5-2670 while using the shared RAM, so I can run my program without any changes and get better results?
Update:
13 threads on Xeon5645 take 196 seconds.
Algorithm randomly explores tree which has 3500 nodes. Height of tree is 7. Each node contains 250 doubles which are also randomly accessed. It is very likely that almost no data is cached.
Specs on the two Intel CPUs you've listed:
E5-2670 - 2.6ghz minimum [8 active cores] (3.3ghz turbo on a single core)
E5645 - 2.4ghz minimum [6 active cores] (2.8ghz turbo on a single core)
So there is at least one important question to ask yourself here:
Why isn't your app faster as a single core? There is much more of a speed drop scaling up from 1 core to 8 cores on the E5-2670 than there is a speed drop switching to the E5645. You shouldn't notice a linear progression from 1 to 16 threads, even if your app has zero inter-thread locks -- all current-gen CPUs drop clock rate as more threads are added to their workload.
The answer is probably not RAM at least in a basic sense, but it might be "L1/L2 caches". The L1/L2 caches are much more important for application performance than RAM throughput. Modern Intel CPUs are designed around the idea that L1/L2 cache hit rates will likely be good (if not great). If the L1/L2 caches are rendered useless by an algorithm that's churning through megabytes of memory without some frequent reuse pattern, then the CPU will essentially become bottlenecked against the RAM latency.
RAM Latency is Not RAM Throughput
While the throughput of the ram is probably plenty enough to keep up with all your threads over time, the latency is not. Latency reading from RAM is 80-120 cycles, depending on CPU clock multiplier. By comparison, latency reading from L1 is 3 cycles, from L2 11-12 cycles. Therefore, if some portion of your algorithm is always resulting in a fetch from RAM, then that portion will always take a very long time to execute, and approx the same time on different CPUs since the ram latency will be about the same. 100 cycles on a Xeon is long enough that even a single stall against RAM can become the dominant hot-spot in an algo (consider that these chips avg 3 instructions per cycle).
I do not know if this is the actual bottleneck on your application, since I don't know how much data it processes on each iteration, or what access ram patterns it uses. But this is one of the only explanations for having a constant-time algo across many configurations of thread, and across different Xeon CPUs.
(Edit: There's also a shared L3 cache on these Xeon chips, but its helpfulness is pretty limited. The latency on L3 accesses is 50-60 cycles -- better than RAM but not by much. And the chance to hit L3 is pretty slim if both L1/L2 are already ineffective. As mentioned before, these chips are designed with high L1/L2 hit rates in mind: The L3 cache arrangement is built in such a way to compliment occasional misses from L1/L2, and does not do well serving data as a primary cache itself)
Two tipps:
1) set number of threads to num cores + 1.
2) cpu speed tell little, it is also speed and size of first and 2nd level cpu cache. and memory, too. (My Quadcore is nominally 20% faster than my dual core laptop, but in reality with a single thread high cpu application. it is 400 - 800% faster. (caused by faster memory, cpu design, cache, etc.)
Server processing power is often less than that of a private PC because they are more designed for robustness and 24h uptime.