Sporadically failing Cassandra queries

Sporadically failing Cassandra queries - java

We're experiencing issues with constinuously running java applications that update counters in Cassandra. From monitoring the load of the servers we don't see any correlations with the load. The queries are quite constant, because they update values in only 8 different tables. Every minute the java applications fires thousands of queries (can be 20k or even 50k queries), but every once in a while some of those fail. When that happens we write them to a file, along with the exception message. This message is always
Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
We did some googling and troubleshooting and took several actions:
Changed the retry policy in the java applications to DefaultRetryPolicy instead of the FallthroughRetryPolicy, to have the client retry a query on failure.
Changed the write_request_timeout_in_ms setting on the Cassandra nodes from the standard value of 2000 to 4000 and then to 10000.
These actions diminished the number of failing queries, but they still occur. From the millions of queries that are executed on an hourly basis, we see about 2000 failing queries over a period of 24 hours. All have the same exception listed above, and they occur at varying times.
Of course we see from the logs that when queries do fail, it takes a while, because it's waiting for a time out and performs retries.
Some facts:
We run Cassandra v2.2.5 (recently upgraded from v2.2.4)
We have a geo aware Cassandra cluster with 6 nodes: 3 in Europe, 3 in US.
The java applications that fire queries are the only clients that communicate with Cassandra (for now).
The number of java applications is 10: 5 in EU, 5 in US.
We execute all queries asynchronously (session.executeAsync(statement);) and keep track of which individual queries by adding callbacks for success and failure.
The replication factor is 2.
The replication factor is 2.
We run Oracle Java 1.7.0_76 Java(TM) SE Runtime Environment (build 1.7.0_76-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.76-b04, mixed mode)
The 6 Cassandra nodes run on bare metal with the following specs:
Storage is a group of SSDs in raid 5.
Each node has 2x (6 core) Intel Xeon E5-2620 CPU's # 2.00GHz (totalling the number of hardware threads to 24).
The RAM size is 128GB.
How we create the cluster:
private Cluster createCluster() {
return Cluster.builder()
.addContactPoints(contactPoints)
.withRetryPolicy(DefaultRetryPolicy.INSTANCE)
.withLoadBalancingPolicy(getLoadBalancingPolicy())
.withReconnectionPolicy(new ConstantReconnectionPolicy(reconnectInterval))
.build();
}
private LoadBalancingPolicy getLoadBalancingPolicy() {
return DCAwareRoundRobinPolicy.builder()
.withUsedHostsPerRemoteDc(allowedRemoteDcHosts) // == 3
.build();
}
How we create the keyspace:
CREATE KEYSPACE IF NOT EXISTS traffic WITH REPLICATION = { 'class': 'NetworkTopologyStrategy', 'AMS1': 2, 'WDC1': 2};
Example table (they all look similar)
CREATE TABLE IF NOT EXISTS traffic.per_node (
node text,
request_time timestamp,
bytes counter,
ssl_bytes counter,
hits counter,
ssl_hits counter,
PRIMARY KEY (edge, request_time)
) WITH CLUSTERING ORDER BY (request_time DESC)
AND compaction = {'class': 'DateTieredCompactionStrategy'};

Many remarks:
first for the Cluster config, you should specify the local DC name
you should use LOCAL_ONE instead of ONE for consistency level to enhance data locality
DO NOT change the write_request_timeout_in_ms value. You're just sweeping issues under the carpet, your real issue is not the timeout setting
What is your Replication Factor ?
Every minute the java applications fires thousands of queries (can be 20k or even 50k queries)--> simple maths give me ~ 300 inserts/sec per node with the assumption that RF=1. It is not that huge but your inserts may be limited by hardware. What is your CPU config (number of cores) and disk type (spinning disk or SSD) ?
Do you throttle the async inserts ? E.g. fire those in batch of N inserts and wait a little bit for the cluster to breath. See my answer here for throttling: What is the best way to get backpressure for Cassandra Writes?

Related

Elasticsearch 5 stuck reading from disk

I have a cluster of 6 nodes with ES 5.4 with 4B small documents yet indexed.
Documents are organized in ~9K indexes, for a total of 2TB. The indexes' occupancy varies from few KB to hundreds of GB and they are sharded in order to keep each shard under 20GB.
Cluster health query responds with:
{
cluster_name: "##########",
status: "green",
timed_out: false,
number_of_nodes: 6,
number_of_data_nodes: 6,
active_primary_shards: 9014,
active_shards: 9034,
relocating_shards: 0,
initializing_shards: 0,
unassigned_shards: 0,
delayed_unassigned_shards: 0,
number_of_pending_tasks: 0,
number_of_in_flight_fetch: 0,
task_max_waiting_in_queue_millis: 0,
active_shards_percent_as_number: 100
}
Before sending any query to the cluster, it is stable and it gets a bulk index query every second with 10 or some thousand of documents with no problem.
Everything is fine until I redirect some traffic to this cluster.
As soon as it starts to respond the majority of the servers start reading from disk at 250 MB/s making the cluster unresponsive:
What it is strange is that I cloned this ES configuration on AWS (same hardware, same Linux kernel, but different Linux version) and there I have no problem:
NB: note that 40MB/s of disk read is what I always had on servers that are serving traffic.
Relevant Elasticsearch 5 configurations are:
Xms12g -Xmx12g in jvm.options
I also tested it with the following configurations, but without succeeded:
bootstrap.memory_lock:true
MAX_OPEN_FILES=1000000
Each server has 16CPU and 32GB of RAM; some have Linux Jessie 8.7, other Jessie 8.6; all have kernel 3.16.0-4-amd64.
I checked that cache on each node with localhost:9200/_nodes/stats/indices/query_cache?pretty&human and all the servers have similar statistics: cache size, cache hit, miss and eviction.
It doesn't seem a warm up operation, since on AWS cloned cluster I never see this behavior and also because it never ends.
I can't find useful information under /var/log/elasticsearch/*.
Am I doing anything wrong?
What should I change in order to solve this problem?
Thanks!

You probably need to reduce the number of threads for searching.
Try going with 2x the number of processors. In the elasticsearch.yaml:
threadpool.search.size:<size>
Also, that sounds like too many shards for a 6 node cluster. If possible, I would try reducing that.

The max content of an HTTP request. Defaults to 100mb
servers start reading from disk at 250 MB/s making the cluster unresponsive - The max content of an HTTP request. Defaults to 100mb. . If set to greater than Integer.MAX_VALUE, it will be reset to 100mb.
This will become unresponsive and you might see the logs related this. Check with the max read size of the indices.
Check with Elasticsearch HTTP

a few things;
5.x has been EOL for years now, please upgrade as a matter of urgency
you are heavily oversharded
for point 2 - you either need to;
upgrade to handle that amount of shards, the memory management in 7.X is far superior
reduce your shard count by reindexing
add more nodes to deal with the load

Discrepancy between Cassandra trace and client-side latency

We're on Cassandra 2.0.15, and seeing huge read latencies (>60sec) coming up at regular intervals (about every 3min), from all app hosts. We measure this latency around calls to session.execute(stmt). At the same time, Cassandra traces report duration of <1s. We also ran, in a loop, a query via cqlsh from the same hosts during those peak latency times, and cqlsh always came back within 1s. What can explain this discrepancy at the Java driver level?
-- edit: in reply to comments --
Cassandra servers JVM settings: -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -Xms32G -Xmx32G -XX:+UseG1GC -Djava.net.preferIPv4Stack=true -Dcassandra.jmx.local.port=7199 -XX:+DisableExplicitGC.
Client side GC is negligible (below). Client settings: -Xss256k -Xms4G -Xmx4G, Cassandra driver version is 2.1.7.1
Client side measuring code:
val selectServiceNames = session.prepare(QueryBuilder.select("service_name").from("service_names"))
override def run(): Unit = {
val start = System.currentTimeMillis()
try {
val resultSet = session.execute(selectServiceNames.bind())
val serviceNames = resultSet.all()
val elapsed = System.currentTimeMillis() - start
latency.add(elapsed) // emits metric to statsd
if (elapsed > 10000) {
log.info("Canary2 sensed high Cassandra latency: " + elapsed + "ms")
}
} catch {
case e: Throwable =>
log.error(e, "Canary2 select failed")
} finally {
Thread.sleep(100)
schedule()
}
}
Cluster construction code:
def createClusterBuilder(): Cluster.Builder = {
val builder = Cluster.builder()
val contactPoints = parseContactPoints()
val defaultPort = findConnectPort(contactPoints)
builder.addContactPointsWithPorts(contactPoints)
builder.withPort(defaultPort) // This ends up config.protocolOptions.port
if (cassandraUsername.isDefined && cassandraPassword.isDefined)
builder.withCredentials(cassandraUsername(), cassandraPassword())
builder.withRetryPolicy(ZipkinRetryPolicy.INSTANCE)
builder.withLoadBalancingPolicy(new TokenAwarePolicy(new LatencyAwarePolicy.Builder(new RoundRobinPolicy()).build()))
}
One more observation I cannot explain. I ran two threads that execute the same query in the same manner (as above) in a loop, the only difference is yellow thread sleeps 100millisec between queries, and green thread sleeps 60sec between queries. Green thread hits low latency (under 1s) much more often than the yellow one.

This is a common problem when you get a compoent to test itself.
you can experience delays which are not visible to the tools in question.
your component has no idea when the request should have started.
when the JVM stops, this can prevent you from seeing the delays you are try to measure.
The most likely explanation is the second one. Say you have a queue of 100 tasks but because the system is running slowly each task is taking 1 second. You time each task internally and it sees it took 1 seconds, however add 100 tasks to the queue and the first one starts after 0 second, but the last starts after 99 seconds and then reports it took 1 second, but from your point of view it took 100 seconds to complete, 99 seconds of which was waiting to start.
There can also be delays in the result reaching you but this is less likely unless the operations you do in processing the results is more than the database takes. i.e. you might assume the bottleneck is on the server.

I tracked the issue down to queries timing out on the nodes from remote data center. The cluster has nodes in two DCs, but the keyspace is only replicated within the local DC, so it is surprising that remove nodes were even considered. I was able to bring the latency down by
changing from ONE to LOCAL_ONE consistency and
changing from plain round-robin load balancer to DC-aware one (also using latency-aware and token-aware).
It still feels to me like a bug in the Java driver that it tries to use nodes from remote data center as coordination nodes when the keyspace is clearly non-existent in that data center. Also, even if that wasn't possible somehow, I was also using latency-aware policy, which should've excluded remote DC nodes from consideration.

storm - finding source(s) of latency

I have a three part topology that's having some serious latency issues but I'm having trouble figuring out where.
kafka -> db lookup -> write to cassandra
The numbers from the storm UI look like this:
(I see that the bolts are running at > 1.0 capacity)
If the process latency for the two bolts is ~65ms why is the 'complete latency' > 400 sec? The 'failed' tuples are coming from timeouts I suspect as the latency value is steadily increasing.
The tuples are connected via shuffleGrouping.
Cassandra lives on AWS so there are likely network limitations en route.
The storm cluster has 3 machines. There are 3 workers in the topology.

Your topology has several problems:
look at the capacity of the decode_bytes_1 and save_to_cassandra spouts. Both are over 1 (the sum of all spouts capacity should be under 1), which means you are using more resources than what do you have available. This is, the topology can't handle the load.
The TOPOLOGY_MAX_SPOUT_PENDING will solve your problem if the throughput of tuples varies during the day. This is, if you have peek hours, and you will be catch up during the off-peek hours.
You need to increase the number of worker machines or optimize the code in the bottle neck spouts (or maybe both). Otherwise you will not be able to process all the tuples.
You probably can improve the cassandra persister by inserting in batches instread of insert tuples one by one...
I seriously recommend you to always set the TOPOLOGY_MAX_SPOUT_PENDING for a conservative value. The max spout pending, means the maximum number of un-acked tuples inside the topology, remember this value is multiplied by the number of spots and the tuples will timeout (fail) if they are not acknowledged 30 seconds after being emitted.
And yes, your problem is having tuples timing out, this is exactly what is happening.
(EDIT) if you are running the dev environment (or just after deploy the topology) you might experience a spike in the traffic generated by messages that were not yet consumed by the spout; it's important you prevent this case to negatively affect your topology -- you never know when you need to restart the production topology, or perform some maintenance --, if this is the case you can handle it as a temporary spike in the traffic --the spout needs to consume all the messages produced while the topology was off-line -- and after a some (or many minutes) the frequency of incoming tuples stabilizes; you can handle this with max pout pending parameter (read item 2 again).
Considering you have 3 nodes in your cluster, and cpu usage of 0,1 you can add more executers to the bolts.

FWIW - it appears that the default value for TOPOLOGY_MAX_SPOUT_PENDING is unlimited. I added a call to stormConfig.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 500); and it appears (so far) that the problem has been alleviated. Possible 'thundering herd' issue?
After setting the TOPOLOGY_MAX_SPOUT_PENDING to 500:

Multiple threads waiting for nothing?

TLDR : during a multithreaded massive database insertion, multiple thread are waiting for no evident reason.
We need to create multiple rows in a database. To speed up insertion, we use multithreading so that multiple objects can be generated and inserted in parallel. We are using Hibernate, Spring batch and Spring scheduling (ThreadPoolTaskExecutor, Partitioner, ItemProcessor). We started from this example.
We looked at thread states with JVisualVM and noticed that there are never more than 8 active threads at a time, whatever the hardware running the program. We tried "standard desktop" computers (dual core), but also two AIX : one with 8 active CPU, one with 60 active CPUs.
Any idea why we can't have more than 8 working threads at a time?
A list of things we already checked:
All threads have a work to do (Partitioner and ThreadPoolTaskExecutor are configured so that each thread has the same amount of data to insert in DB).
We tried various commit-interval : 1, P where P is the size of the partition, N where N is the sum of all P (it should not be the cause of the problem, but commiting data seems to be the long part of the job while data generation is fast).
8 is not a default value of any object's parameter we use.

Durable map to map to queue for fair scheduling?

Our system needs to process billions of queries from thousands of clients for millions of resources. Some resources will be queried much more often than others. Each client will submit anywhere from hundreds to hundreds-of-millions of queries at a time. Because each resource can only support thousands of queries per minute, the queries will be enqueued and their results will be determined asynchronously.
Now, here's the rub: Each client's queries need to be given equal priority with respect to each resource. That is, if one client submits a million queries for a particular resource, and then another client submits a dozen, immediately after, then the second client should not have to wait for the first client's queries to be processed before theirs are. Rather, first the one client's first query should be handled, and then the other's first query, then the first's second query, and so on, back and forth. (And the analogous idea for more than two clients, and multiple resources; also, it can be a little less granular, as long as this basic idea is preserved).
If this was small enough to be in-memory, we'd just have a map from resources to a map from accounts to a queue of queries, and circularly iterate accounts, per resource; but it's not, so we need a disk-based solution. We also need it to be robust, highly available, transactional etc.. What are my options? I'm using Java SE.
Thanks in advance!

Ahead of time, I know HBase much better than I do Cassandra. Some aspects of my response are HBase specific, and I'll mark them as such.
Assuming that you provision enough hardware, then a BigTable implementation like Cassandra or HBase would give you the following:
The ability to store and retrieve your queries at an extremely high rate
The ability to absorb deletes at an extremely high rate (though with HBase and Cassandra, flushing writes to disk can cause periodic delays)
Trivially, I could see a schema where you used a combination of resource-id as row key and account-id and perhaps timestamp as column key, but (in HBase specifically) this could lead to hotspots in the servers hosting certain popular resources (in both HBase and Cassandra, a single server is responsible for hosting the master copy of any given row at a time). In Cassandra you can reduce the overhead of updates by using async writes (writing to only one or two nodes, and allowing gossip to replicate them), but this could result in old records being around dramatically longer than you expect in situations where network traffic is high. In HBase writes are always consistent and always written to the RegionServer hosting the row, so hotspotting is definitely a potential problem.
You can reduce the impact of hotspotting by making your row key a combination of resource ID and account id, but then you need to scan all row keys to determine the list of accounts that have outstanding queries for a resource.
One other potential advantage that you may not have considered is the potential capability to run your queries directly from the HBase or Cassandra data nodes, saving you the need to ship your query over the network again to an executor process to actually run that query. You might want to look into HBase Coprocessors or Cassandra Plugins to do something like that. Specifically I am talking about turning this workflow:
/-> Query -> Executor -> Resource -> Results -> \
Client -> Query -> Query Storage --> Query -> Executor -> Resource -> Results -> --> Client
\-> Query -> Executor -> Resource -> Results -> /
into something like:
/-> Query -> Resource -> Results -> \
Client -> Query -> Query Storage --> Query -> Resource -> Results -> --> Client
\-> Query -> Resource -> Results -> /
This may not make sense in your use case though.

I can give you some answers with respect to Cassandra.
Cassandra internally writes only new data files and only does so sequentially, never overwriting or modifying existing files, and has an append-only write-ahead log like transactional relational databases. Cassandra internally sees deletes as essentially just like any other writes.
Cassandra is linearly scalable across many nodes and has no single point of failure. It is linearly scalable for both reads and writes. That is to say, a single cluster can support any number of concurrent reads and writes you wish to throw at it, so long as you add enough nodes to the cluster and give the cluster time to rebalance data across the new nodes. Netflix recently load-tested Cassandra on EC2 and found linear scalability, with the largest cluster they tested at 288 nodes supporting 1,000,000 writes/sec sustained for an hour.
Cassandra supports many consistency levels. When performing each read or write from Cassandra, you specify with what consistency level you want that read or write to be executed. This lets you determine, per-read and per-write, whether that read or write must be fast or must be done consistently across all nodes hosting that row.
Cassandra does not support multi-operation transactions.
If the Cassandra data model works well in your case, Cassandra may well be the simplest solution, at least at the operations level. Every node is configured exactly alike. There are no masters and no slaves, only peers of equals. It is not necessary to set up separate load balancing, failover, heartbeats, log shipping, replication, etc.
But the only way to find out for sure is to test it out.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.