Recently I have been playing with Cassandra.
I have been experienced latency spikes when adding nodes to Cassandra while nodetool stream limits are set on all existing C* nodes.
To be specific, originally the cluster has 4 C* nodes and I am adding 2 additional nodes when the original ones are warmed up at 1200 s as shown in the figure.
The amount of data stored is 50 GB on these 4 nodes and the key size is 20 KB each.
Nodetool is used to set the 'stream limits' to 1MB/s.
YCSB is used to generate read dominant (90%) workloads at 80% of the maximum throughput that can be reach by these 4 existing nodes through out the scale up procedure.
The figures shows the output service latency from YCSB every 10 second.
time vs. read latency on C*
Does anyone has some answers to the latency spike?
Maybe is the gc or compaction in the background?
Or just the bandwidth saturated, which does not seem so since I have set stream limits to 1MB/s?
Related
I have an AWS hosted ElasticSearch cluster, which fails when heap reaches 75% and the (CMS) garbage collector runs.
The cluster runs ES version 7.9 with 3 dedicated Master nodes (r5.large.elasticsearch) and 4 Data nodes (r5.xlarge.elasticsearch)
That is:
4 vCPU / 32GB instance per Data Node (16GB heap), with 1TB of SDD storage each, for a total of 4TB storage.
2 vCUP / 16GB instance per Master node
The cluster holds 33 indices with 1-3 primary shards each and 0-1 replicas (0 for the older ones), and a size ranging between 50Mb to 60Gb per shard, but in general each shard stores 30gb.
So about 65 shards in total.
Whenever the JVM Memory Pressure goes up to 75% and the Garbage Collector (GC) runs we start to get Timeouts and the node running the GC goes down for a moment and then back up, causing shards reallocation, more timeouts, increased index and search latencies.
Checking the error logs we could see a lot of:
[WARN ][o.e.m.j.JvmGcMonitorService] [node_id] [gc][2315905] overhead, spent [6.4s] collecting in the last [7.2s]
[WARN ][o.e.m.j.JvmGcMonitorService] [node_id] [gc][2315905] overhead, spent [3.6s] collecting in the last [4.4s]
...
On peak hours our Indexing rate is about 4k operations/min and search rate is 1k operations/min
The GC runs about 3 times a day per data node, about 12 times a day per cluster, and the maximum Heap percent among the 4 data nodes oscillates between 35% and 75%, it never goes above 75%. When the GC is not running, CPU stays consistently at an average of 13% - 15%, so we’re highly confident that the instance size is the appropriate one for our current traffic.
Followed some guides on how to avoid node crashes, but:
Rarely aggregate on text fields.
No complex aggregations.
Shards are evenly distributed, and the number of shards per index seems to be correct.
Very small number of wildcard queries, which are manually triggered.
All the documents are small-medium sized (500 - 1000 characters).
So, any ideas on what could possibly be causing these crashes and long GC runs?
Found some related questions with no answer such as this
So I have a job that does in mapper computing. With each task taking about .08 seconds, a 360026 line file will take about 8 hours to just do this. If it was done on one node. File sizes will generally be about the size of 1-2 block sizes (often 200 MB or less).
Assuming the in code is optimized, is there anyway to mess with the settings? Should I be using a smaller block size for example? I currently am using AWS EMR, with c4.large instances and autoscaling on YARN, but it only went up to 4 extra task nodes, as the load wasn't too high. Even though YARN memory wasn't too high, it still took over 7 hours to complete (which is way to long).
I have a cluster of 6 nodes with ES 5.4 with 4B small documents yet indexed.
Documents are organized in ~9K indexes, for a total of 2TB. The indexes' occupancy varies from few KB to hundreds of GB and they are sharded in order to keep each shard under 20GB.
Cluster health query responds with:
{
cluster_name: "##########",
status: "green",
timed_out: false,
number_of_nodes: 6,
number_of_data_nodes: 6,
active_primary_shards: 9014,
active_shards: 9034,
relocating_shards: 0,
initializing_shards: 0,
unassigned_shards: 0,
delayed_unassigned_shards: 0,
number_of_pending_tasks: 0,
number_of_in_flight_fetch: 0,
task_max_waiting_in_queue_millis: 0,
active_shards_percent_as_number: 100
}
Before sending any query to the cluster, it is stable and it gets a bulk index query every second with 10 or some thousand of documents with no problem.
Everything is fine until I redirect some traffic to this cluster.
As soon as it starts to respond the majority of the servers start reading from disk at 250 MB/s making the cluster unresponsive:
What it is strange is that I cloned this ES configuration on AWS (same hardware, same Linux kernel, but different Linux version) and there I have no problem:
NB: note that 40MB/s of disk read is what I always had on servers that are serving traffic.
Relevant Elasticsearch 5 configurations are:
Xms12g -Xmx12g in jvm.options
I also tested it with the following configurations, but without succeeded:
bootstrap.memory_lock:true
MAX_OPEN_FILES=1000000
Each server has 16CPU and 32GB of RAM; some have Linux Jessie 8.7, other Jessie 8.6; all have kernel 3.16.0-4-amd64.
I checked that cache on each node with localhost:9200/_nodes/stats/indices/query_cache?pretty&human and all the servers have similar statistics: cache size, cache hit, miss and eviction.
It doesn't seem a warm up operation, since on AWS cloned cluster I never see this behavior and also because it never ends.
I can't find useful information under /var/log/elasticsearch/*.
Am I doing anything wrong?
What should I change in order to solve this problem?
Thanks!
You probably need to reduce the number of threads for searching.
Try going with 2x the number of processors. In the elasticsearch.yaml:
threadpool.search.size:<size>
Also, that sounds like too many shards for a 6 node cluster. If possible, I would try reducing that.
The max content of an HTTP request. Defaults to 100mb
servers start reading from disk at 250 MB/s making the cluster unresponsive - The max content of an HTTP request. Defaults to 100mb. . If set to greater than Integer.MAX_VALUE, it will be reset to 100mb.
This will become unresponsive and you might see the logs related this. Check with the max read size of the indices.
Check with Elasticsearch HTTP
a few things;
5.x has been EOL for years now, please upgrade as a matter of urgency
you are heavily oversharded
for point 2 - you either need to;
upgrade to handle that amount of shards, the memory management in 7.X is far superior
reduce your shard count by reindexing
add more nodes to deal with the load
I will be hosting my Cassandra database on Google cloud. Instances are priced in a linear fashion meaning 1cpu with 2gb ram is $1, 2cpu with 4gb is $2, 4cpu with 8GB is $4 and so on.
I am deciding on the size of my instances and am not sure what the standard is? I was thinking of using more fewer larger instances (8cpu, 64gb) opposed to lighter such as (2cpu, 4 gb). My thought process is with more instances each node will carry less of the overall data which would have a smaller impact if nodes fail. As well, the os of these smaller instances would have less overhead because it would accept less connections.
These are pros, but here are some cons I can think of:
1) Each instance will be less utilized
2) Cassandra + JVM overhead on so many instances can add up and be a lot of overhead.
3) I will be using local SSD opposed to persistent SSD which are much more expensive meaning each instance will need their own local SSD which raises costs.
These are some reasons I can think of, is there any other pros/cons between choosing more smaller instances vs fewer larger for a Cassandra database (maybe even nodes in general)? Are there any best practices associated to choosing Cassandra server sizes?
PS: I added the 'Java' tag because Cassandra is built using JAVA and runs on the JVM and would like to see if the JVM has any pros/cons.
I think you've hit some of the tradeoff points, but here are a few other things:
As the amount of data stored on a single node increases, the cost of bootstrapping (adding new nodes) increases. For instance, you'll get reasonable bootstrapping times storing 100 GB per node, but the process will take eons with 10 TB per node.
SSD usage makes this less important, but consider using separate physical disks for your commitlog and data.
Configurations with fewer than 4 cores or less than 8 GB of memory are usually not recommended, but your mileage may vary.
I'm trying to improve query performance. It takes an average of about 3 seconds for simple queries which don't even touch a nested document, and it's sometimes longer.
curl "http://searchbox:9200/global/user/_search?n=0&sort=influence:asc&q=user.name:Bill%20Smith"
Even without the sort it takes seconds. Here are the details of the cluster:
1.4TB index size.
210m documents that aren't nested (About 10kb each)
500m documents in total. (nested documents are small: 2-5 fields).
About 128 segments per node.
3 nodes, m2.4xlarge (-Xmx set to 40g, machine memory is 60g)
3 shards.
Index is on amazon EBS volumes.
Replication 0 (have tried replication 2 with only little improvement)
I don't see any noticeable spikes in CPU/memory etc. Any ideas how this could be improved?
Garry's points about heap space are true, but it's probably not heap space that's the issue here.
With your current configuration, you'll have less than 60GB of page cache available, for a 1.5 TB index. With less than 4.2% of your index in page cache, there's a high probability you'll be needing to hit disk for most of your searches.
You probably want to add more memory to your cluster, and you'll want to think carefully about the number of shards as well. Just sticking to the default can cause skewed distribution. If you had five shards in this case, you'd have two machines with 40% of the data each, and a third with just 20%. In either case, you'll always be waiting for the slowest machine or disk when doing distributed searches. This article on Elasticsearch in Production goes a bit more in depth on determining the right amount of memory.
For this exact search example, you can probably use filters, though. You're sorting, thus ignoring the score calculated by the query. With a filter, it'll be cached after the first run, and subsequent searches will be quick.
Ok, a few things here:
Decrease your heap size, you have a heap size of over 32gb dedicated to each Elasticsearch instance on each platform. Java doesn't compress pointers over 32gb. Drop your nodes to only 32gb and, if you need to, spin up another instance.
If spinning up another instance instance isn't an option and 32gb on 3 nodes isn't enough to run ES then you'll have to bump your heap memory to somewhere over 48gb!
I would probably stick with the default settings for shards and replicas. 5 shards, 1 replica. However, you can tweak the shard settings to suit. What I would do is reindex the data in several indices under several different conditions. The first index would only have 1 shard, the second index would have 2 shards, I'd do this all the way up to 10 shards. Query each index and see which performs best. If the 10 shard index is the best performing one keep increasing the shard count until you get worse performance, then you've hit your shard limit.
One thing to think about though, sharding might increase search performance but it also has a massive effect on index time. The more shards the longer it takes to index a document...
You also have quite a bit of data stored, maybe you should look at Custom Routing too.