Elasticsearch 5 stuck reading from disk

Elasticsearch 5 stuck reading from disk - java

I have a cluster of 6 nodes with ES 5.4 with 4B small documents yet indexed.
Documents are organized in ~9K indexes, for a total of 2TB. The indexes' occupancy varies from few KB to hundreds of GB and they are sharded in order to keep each shard under 20GB.
Cluster health query responds with:
{
cluster_name: "##########",
status: "green",
timed_out: false,
number_of_nodes: 6,
number_of_data_nodes: 6,
active_primary_shards: 9014,
active_shards: 9034,
relocating_shards: 0,
initializing_shards: 0,
unassigned_shards: 0,
delayed_unassigned_shards: 0,
number_of_pending_tasks: 0,
number_of_in_flight_fetch: 0,
task_max_waiting_in_queue_millis: 0,
active_shards_percent_as_number: 100
}
Before sending any query to the cluster, it is stable and it gets a bulk index query every second with 10 or some thousand of documents with no problem.
Everything is fine until I redirect some traffic to this cluster.
As soon as it starts to respond the majority of the servers start reading from disk at 250 MB/s making the cluster unresponsive:
What it is strange is that I cloned this ES configuration on AWS (same hardware, same Linux kernel, but different Linux version) and there I have no problem:
NB: note that 40MB/s of disk read is what I always had on servers that are serving traffic.
Relevant Elasticsearch 5 configurations are:
Xms12g -Xmx12g in jvm.options
I also tested it with the following configurations, but without succeeded:
bootstrap.memory_lock:true
MAX_OPEN_FILES=1000000
Each server has 16CPU and 32GB of RAM; some have Linux Jessie 8.7, other Jessie 8.6; all have kernel 3.16.0-4-amd64.
I checked that cache on each node with localhost:9200/_nodes/stats/indices/query_cache?pretty&human and all the servers have similar statistics: cache size, cache hit, miss and eviction.
It doesn't seem a warm up operation, since on AWS cloned cluster I never see this behavior and also because it never ends.
I can't find useful information under /var/log/elasticsearch/*.
Am I doing anything wrong?
What should I change in order to solve this problem?
Thanks!

You probably need to reduce the number of threads for searching.
Try going with 2x the number of processors. In the elasticsearch.yaml:
threadpool.search.size:<size>
Also, that sounds like too many shards for a 6 node cluster. If possible, I would try reducing that.

The max content of an HTTP request. Defaults to 100mb
servers start reading from disk at 250 MB/s making the cluster unresponsive - The max content of an HTTP request. Defaults to 100mb. . If set to greater than Integer.MAX_VALUE, it will be reset to 100mb.
This will become unresponsive and you might see the logs related this. Check with the max read size of the indices.
Check with Elasticsearch HTTP

a few things;
5.x has been EOL for years now, please upgrade as a matter of urgency
you are heavily oversharded
for point 2 - you either need to;
upgrade to handle that amount of shards, the memory management in 7.X is far superior
reduce your shard count by reindexing
add more nodes to deal with the load

Related

ElasticSearch single node cluster runs out of memory

I have a single node ElasticSearch cluster that has one index (I know, my bad) where I inserted 2B documents.
I did not know it was a best practice to split indices and mine grew to 400GB before it crashed.
I tried splitting my index with (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-split-index.html) and I keep getting java.lang.OutOfMemoryError no matter what I do. I have maxed out my physical memory and threads just got stuck in the _split.
I had some files that were deleted via logstash when they were successfully indexed, so reinserting the data is not an option.
Any suggestions?

Add swap space or increase RAM of that server.
I'm still confused as to where you got 2 Billion documents :/

Never use swap in ES Machines ,
Use https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-recovery.html to check status of split
also do you changed the max memory option in jvm config for ES - https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html

Sporadically failing Cassandra queries

We're experiencing issues with constinuously running java applications that update counters in Cassandra. From monitoring the load of the servers we don't see any correlations with the load. The queries are quite constant, because they update values in only 8 different tables. Every minute the java applications fires thousands of queries (can be 20k or even 50k queries), but every once in a while some of those fail. When that happens we write them to a file, along with the exception message. This message is always
Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
We did some googling and troubleshooting and took several actions:
Changed the retry policy in the java applications to DefaultRetryPolicy instead of the FallthroughRetryPolicy, to have the client retry a query on failure.
Changed the write_request_timeout_in_ms setting on the Cassandra nodes from the standard value of 2000 to 4000 and then to 10000.
These actions diminished the number of failing queries, but they still occur. From the millions of queries that are executed on an hourly basis, we see about 2000 failing queries over a period of 24 hours. All have the same exception listed above, and they occur at varying times.
Of course we see from the logs that when queries do fail, it takes a while, because it's waiting for a time out and performs retries.
Some facts:
We run Cassandra v2.2.5 (recently upgraded from v2.2.4)
We have a geo aware Cassandra cluster with 6 nodes: 3 in Europe, 3 in US.
The java applications that fire queries are the only clients that communicate with Cassandra (for now).
The number of java applications is 10: 5 in EU, 5 in US.
We execute all queries asynchronously (session.executeAsync(statement);) and keep track of which individual queries by adding callbacks for success and failure.
The replication factor is 2.
The replication factor is 2.
We run Oracle Java 1.7.0_76 Java(TM) SE Runtime Environment (build 1.7.0_76-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.76-b04, mixed mode)
The 6 Cassandra nodes run on bare metal with the following specs:
Storage is a group of SSDs in raid 5.
Each node has 2x (6 core) Intel Xeon E5-2620 CPU's # 2.00GHz (totalling the number of hardware threads to 24).
The RAM size is 128GB.
How we create the cluster:
private Cluster createCluster() {
return Cluster.builder()
.addContactPoints(contactPoints)
.withRetryPolicy(DefaultRetryPolicy.INSTANCE)
.withLoadBalancingPolicy(getLoadBalancingPolicy())
.withReconnectionPolicy(new ConstantReconnectionPolicy(reconnectInterval))
.build();
}
private LoadBalancingPolicy getLoadBalancingPolicy() {
return DCAwareRoundRobinPolicy.builder()
.withUsedHostsPerRemoteDc(allowedRemoteDcHosts) // == 3
.build();
}
How we create the keyspace:
CREATE KEYSPACE IF NOT EXISTS traffic WITH REPLICATION = { 'class': 'NetworkTopologyStrategy', 'AMS1': 2, 'WDC1': 2};
Example table (they all look similar)
CREATE TABLE IF NOT EXISTS traffic.per_node (
node text,
request_time timestamp,
bytes counter,
ssl_bytes counter,
hits counter,
ssl_hits counter,
PRIMARY KEY (edge, request_time)
) WITH CLUSTERING ORDER BY (request_time DESC)
AND compaction = {'class': 'DateTieredCompactionStrategy'};

Many remarks:
first for the Cluster config, you should specify the local DC name
you should use LOCAL_ONE instead of ONE for consistency level to enhance data locality
DO NOT change the write_request_timeout_in_ms value. You're just sweeping issues under the carpet, your real issue is not the timeout setting
What is your Replication Factor ?
Every minute the java applications fires thousands of queries (can be 20k or even 50k queries)--> simple maths give me ~ 300 inserts/sec per node with the assumption that RF=1. It is not that huge but your inserts may be limited by hardware. What is your CPU config (number of cores) and disk type (spinning disk or SSD) ?
Do you throttle the async inserts ? E.g. fire those in batch of N inserts and wait a little bit for the cluster to breath. See my answer here for throttling: What is the best way to get backpressure for Cassandra Writes?

Latency spike on Cassandra when adding nodes

Recently I have been playing with Cassandra.
I have been experienced latency spikes when adding nodes to Cassandra while nodetool stream limits are set on all existing C* nodes.
To be specific, originally the cluster has 4 C* nodes and I am adding 2 additional nodes when the original ones are warmed up at 1200 s as shown in the figure.
The amount of data stored is 50 GB on these 4 nodes and the key size is 20 KB each.
Nodetool is used to set the 'stream limits' to 1MB/s.
YCSB is used to generate read dominant (90%) workloads at 80% of the maximum throughput that can be reach by these 4 existing nodes through out the scale up procedure.
The figures shows the output service latency from YCSB every 10 second.
time vs. read latency on C*
Does anyone has some answers to the latency spike?
Maybe is the gc or compaction in the background?
Or just the bandwidth saturated, which does not seem so since I have set stream limits to 1MB/s?

storm - finding source(s) of latency

I have a three part topology that's having some serious latency issues but I'm having trouble figuring out where.
kafka -> db lookup -> write to cassandra
The numbers from the storm UI look like this:
(I see that the bolts are running at > 1.0 capacity)
If the process latency for the two bolts is ~65ms why is the 'complete latency' > 400 sec? The 'failed' tuples are coming from timeouts I suspect as the latency value is steadily increasing.
The tuples are connected via shuffleGrouping.
Cassandra lives on AWS so there are likely network limitations en route.
The storm cluster has 3 machines. There are 3 workers in the topology.

Your topology has several problems:
look at the capacity of the decode_bytes_1 and save_to_cassandra spouts. Both are over 1 (the sum of all spouts capacity should be under 1), which means you are using more resources than what do you have available. This is, the topology can't handle the load.
The TOPOLOGY_MAX_SPOUT_PENDING will solve your problem if the throughput of tuples varies during the day. This is, if you have peek hours, and you will be catch up during the off-peek hours.
You need to increase the number of worker machines or optimize the code in the bottle neck spouts (or maybe both). Otherwise you will not be able to process all the tuples.
You probably can improve the cassandra persister by inserting in batches instread of insert tuples one by one...
I seriously recommend you to always set the TOPOLOGY_MAX_SPOUT_PENDING for a conservative value. The max spout pending, means the maximum number of un-acked tuples inside the topology, remember this value is multiplied by the number of spots and the tuples will timeout (fail) if they are not acknowledged 30 seconds after being emitted.
And yes, your problem is having tuples timing out, this is exactly what is happening.
(EDIT) if you are running the dev environment (or just after deploy the topology) you might experience a spike in the traffic generated by messages that were not yet consumed by the spout; it's important you prevent this case to negatively affect your topology -- you never know when you need to restart the production topology, or perform some maintenance --, if this is the case you can handle it as a temporary spike in the traffic --the spout needs to consume all the messages produced while the topology was off-line -- and after a some (or many minutes) the frequency of incoming tuples stabilizes; you can handle this with max pout pending parameter (read item 2 again).
Considering you have 3 nodes in your cluster, and cpu usage of 0,1 you can add more executers to the bolts.

FWIW - it appears that the default value for TOPOLOGY_MAX_SPOUT_PENDING is unlimited. I added a call to stormConfig.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 500); and it appears (so far) that the problem has been alleviated. Possible 'thundering herd' issue?
After setting the TOPOLOGY_MAX_SPOUT_PENDING to 500:

Elasticsearch improve query performance

I'm trying to improve query performance. It takes an average of about 3 seconds for simple queries which don't even touch a nested document, and it's sometimes longer.
curl "http://searchbox:9200/global/user/_search?n=0&sort=influence:asc&q=user.name:Bill%20Smith"
Even without the sort it takes seconds. Here are the details of the cluster:
1.4TB index size.
210m documents that aren't nested (About 10kb each)
500m documents in total. (nested documents are small: 2-5 fields).
About 128 segments per node.
3 nodes, m2.4xlarge (-Xmx set to 40g, machine memory is 60g)
3 shards.
Index is on amazon EBS volumes.
Replication 0 (have tried replication 2 with only little improvement)
I don't see any noticeable spikes in CPU/memory etc. Any ideas how this could be improved?

Garry's points about heap space are true, but it's probably not heap space that's the issue here.
With your current configuration, you'll have less than 60GB of page cache available, for a 1.5 TB index. With less than 4.2% of your index in page cache, there's a high probability you'll be needing to hit disk for most of your searches.
You probably want to add more memory to your cluster, and you'll want to think carefully about the number of shards as well. Just sticking to the default can cause skewed distribution. If you had five shards in this case, you'd have two machines with 40% of the data each, and a third with just 20%. In either case, you'll always be waiting for the slowest machine or disk when doing distributed searches. This article on Elasticsearch in Production goes a bit more in depth on determining the right amount of memory.
For this exact search example, you can probably use filters, though. You're sorting, thus ignoring the score calculated by the query. With a filter, it'll be cached after the first run, and subsequent searches will be quick.

Ok, a few things here:
Decrease your heap size, you have a heap size of over 32gb dedicated to each Elasticsearch instance on each platform. Java doesn't compress pointers over 32gb. Drop your nodes to only 32gb and, if you need to, spin up another instance.
If spinning up another instance instance isn't an option and 32gb on 3 nodes isn't enough to run ES then you'll have to bump your heap memory to somewhere over 48gb!
I would probably stick with the default settings for shards and replicas. 5 shards, 1 replica. However, you can tweak the shard settings to suit. What I would do is reindex the data in several indices under several different conditions. The first index would only have 1 shard, the second index would have 2 shards, I'd do this all the way up to 10 shards. Query each index and see which performs best. If the 10 shard index is the best performing one keep increasing the shard count until you get worse performance, then you've hit your shard limit.
One thing to think about though, sharding might increase search performance but it also has a massive effect on index time. The more shards the longer it takes to index a document...
You also have quite a bit of data stored, maybe you should look at Custom Routing too.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.