We're on Cassandra 2.0.15, and seeing huge read latencies (>60sec) coming up at regular intervals (about every 3min), from all app hosts. We measure this latency around calls to session.execute(stmt). At the same time, Cassandra traces report duration of <1s. We also ran, in a loop, a query via cqlsh from the same hosts during those peak latency times, and cqlsh always came back within 1s. What can explain this discrepancy at the Java driver level?
-- edit: in reply to comments --
Cassandra servers JVM settings: -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -Xms32G -Xmx32G -XX:+UseG1GC -Djava.net.preferIPv4Stack=true -Dcassandra.jmx.local.port=7199 -XX:+DisableExplicitGC.
Client side GC is negligible (below). Client settings: -Xss256k -Xms4G -Xmx4G, Cassandra driver version is 2.1.7.1
Client side measuring code:
val selectServiceNames = session.prepare(QueryBuilder.select("service_name").from("service_names"))
override def run(): Unit = {
val start = System.currentTimeMillis()
try {
val resultSet = session.execute(selectServiceNames.bind())
val serviceNames = resultSet.all()
val elapsed = System.currentTimeMillis() - start
latency.add(elapsed) // emits metric to statsd
if (elapsed > 10000) {
log.info("Canary2 sensed high Cassandra latency: " + elapsed + "ms")
}
} catch {
case e: Throwable =>
log.error(e, "Canary2 select failed")
} finally {
Thread.sleep(100)
schedule()
}
}
Cluster construction code:
def createClusterBuilder(): Cluster.Builder = {
val builder = Cluster.builder()
val contactPoints = parseContactPoints()
val defaultPort = findConnectPort(contactPoints)
builder.addContactPointsWithPorts(contactPoints)
builder.withPort(defaultPort) // This ends up config.protocolOptions.port
if (cassandraUsername.isDefined && cassandraPassword.isDefined)
builder.withCredentials(cassandraUsername(), cassandraPassword())
builder.withRetryPolicy(ZipkinRetryPolicy.INSTANCE)
builder.withLoadBalancingPolicy(new TokenAwarePolicy(new LatencyAwarePolicy.Builder(new RoundRobinPolicy()).build()))
}
One more observation I cannot explain. I ran two threads that execute the same query in the same manner (as above) in a loop, the only difference is yellow thread sleeps 100millisec between queries, and green thread sleeps 60sec between queries. Green thread hits low latency (under 1s) much more often than the yellow one.
This is a common problem when you get a compoent to test itself.
you can experience delays which are not visible to the tools in question.
your component has no idea when the request should have started.
when the JVM stops, this can prevent you from seeing the delays you are try to measure.
The most likely explanation is the second one. Say you have a queue of 100 tasks but because the system is running slowly each task is taking 1 second. You time each task internally and it sees it took 1 seconds, however add 100 tasks to the queue and the first one starts after 0 second, but the last starts after 99 seconds and then reports it took 1 second, but from your point of view it took 100 seconds to complete, 99 seconds of which was waiting to start.
There can also be delays in the result reaching you but this is less likely unless the operations you do in processing the results is more than the database takes. i.e. you might assume the bottleneck is on the server.
I tracked the issue down to queries timing out on the nodes from remote data center. The cluster has nodes in two DCs, but the keyspace is only replicated within the local DC, so it is surprising that remove nodes were even considered. I was able to bring the latency down by
changing from ONE to LOCAL_ONE consistency and
changing from plain round-robin load balancer to DC-aware one (also using latency-aware and token-aware).
It still feels to me like a bug in the Java driver that it tries to use nodes from remote data center as coordination nodes when the keyspace is clearly non-existent in that data center. Also, even if that wasn't possible somehow, I was also using latency-aware policy, which should've excluded remote DC nodes from consideration.
Related
My current system has around 100 thousand running graphs, Each is built like that:
Amqp Source ~> Processing Stage ~> Sink
Each amqp source receives messages at a rate of 1 per second. Only around 10 thousand graphs receive messages at once, So I've figured there is no need for more than 10 thousand threads running concurrently.
These are currently the settings i'm using:
my-dispatcher {
type = Dispatcher
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 16
parallelism-factor = 2.0
parallelism-max = 32
}
throughput = 20
}
Obviously these settings are not defining enough resources for the wanted performances, So I wonder:
Am I correct to assume that 10 thousand threads are enough?
Is it possible to configure the dispatcher (by editing application.conf) for that amount of threads? How would the configuration look like? Should I pick "fork-join-executor" or "thread-pool-executor" as the executor?
Thanks.
Akka and Akka Streams is based on async, an actor or stream only uses a thread for a chunk of processing and then hands the thread back to the threadpool, this is nice because you can size the threadpool according the number of cores you have to actually execute the threads rather than the things you want to execute. Having many threads will have an overhead, both in scheduling/switching and in that the JVM allocates a stack of somewhere around 0.5-1Mb per thread.
So, 10 thousand actors or running streams, can still execute fine on a small thread pool. Increasing the number of threads may rather slow the processing down than make anything faster as more time is spent on switching between threads. Even the default settings may be fine and you should always benchmark when tuning to see if the changes had the effect you expected.
Generally the fork join pool gives good performance for actors and streams. The thread-pool based one is good for use cases where you cannot avoid blocking (see this section of the docs: https://doc.akka.io/docs/akka/current/dispatchers.html#blocking-needs-careful-management)
I can see a strange behavior with datastax cassandra driver (3.0). I've created a new cluster, then I've started a set of threads using the same cluster object. If I keep threads to 1 or 2, I see an avg extraction time of 5ms, but if I increase threads to 60, extraction time increase to 200ms (per single thread). Strange thing is that, if I let the 60 threads app running and I start on the same machine another process with only 1 threads, extraction time for that single threaded app is again 5ms. So it seems something related to client. I've repeated the same tests many times to avoid the cache cold start problem.
Here is how cluster object is configured:
PoolingOptions poolingOptions = new PoolingOptions();
poolingOptions
.setConnectionsPerHost(HostDistance.LOCAL, parallelism, parallelism+20)
.setConnectionsPerHost(HostDistance.REMOTE, parallelism, parallelism+20)
.setMaxRequestsPerConnection(HostDistance.LOCAL, 32768)
.setMaxRequestsPerConnection(HostDistance.REMOTE, 2000);
this.cluster = Cluster.builder()
.addContactPoints(nodes)
.withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)
.withReconnectionPolicy(new ConstantReconnectionPolicy(100L))
.withLoadBalancingPolicy(new TokenAwarePolicy(DCAwareRoundRobinPolicy.builder().build()))
.withCompression(Compression.LZ4)
.withPoolingOptions(poolingOptions)
.withProtocolVersion(ProtocolVersion.V4)
.build();
Does anyone have experienced the same problem? It seems a client configuration issue. Maybe some additional missing configuration for Netty?
UPDATE 1
What application is doing is extracting chunk of data using a query like:
select * from table where id=? and ts>=? and ts<?
So I have 60 threads that are extracting those data in parallel. id is the partition key. Every query is executed by the thread as:
//Prepare statement
PreparedStatement stmt = ... get the prepared statment cached
BoundStatement bstmt = stmt.bind(...)
//Execute query
long te1 = System.nanoTime();
ResultSet rs = this.session.execute(bstmt);
long te2 = System.nanoTime();
//Fetch...
Iterator<Row> iterator = rs.iterator();
while (!rs.isExhausted() && iterator.hasNext()) { .... }
session is one and shared cross all threads. What I'm measuring is the avg time of the session.execute() method call.
Thanks!
UPDATE 2
Here is schema definition
CREATE TABLE d_t (
id bigint,
xid bigint,
ts timestamp,
avg double,
ce double,
cg double,
p double,
w double,
c double,
sum double,
last double,
max double,
min double,
p75 double,
p90 double,
p95 double,
squad double,
sumq double,
wavg double,
weight double,
PRIMARY KEY ((id), xid, ts)
) WITH CLUSTERING ORDER BY (xid DESC, ts DESC)
and compaction = {'class': 'SizeTieredCompactionStrategy'}
and gc_grace_seconds=86400
and caching = { 'keys' : 'ALL', 'rows_per_partition':'36000' }
and min_index_interval = 2
and max_index_interval = 20;
UPDATE 3
Also tried with
.setMaxRequestsPerConnection(HostDistance.LOCAL, 1)
.setMaxRequestsPerConnection(HostDistance.REMOTE, 1)
with no changes
Ultimately I think it will depend on what your code is doing. Can you share an example?
With regards to increased latency, how are you measuring this? Based on your statement:
Strange thing is that, if I let the 60 threads app running and I start on the same machine another process with only 1 threads, extraction time for that single threaded app is again 5ms.
60 concurrent requests really isn't all too much and in general, you shouldn't need to do a thread-per-request using the datastax java driver. You can achieve high throughput with a single application thread as the netty event loop group the driver uses will do most of the work.
The native protocol C* uses allows many requests per connection. As you have configured here, each connection is maxed out to 32768 concurrent requests. In reality, you don't really need to touch this configuration at all, as the default (1000 requests per connection) is sensible as in practice C* is not going to process more than native_transport_max_threads from cassandra.yaml (128 default) at a time and queue up the rest.
Because of this, you do not need many connections for each host. The default of 1 core connection per host should be more than enough for 60 concurrent requests. Increasing the number of connections per host won't do much for you and in profiling I've found diminishing returns beyond 8 connections per host with high throughputs (thousands of concurrent requests) and throughput getting worse past 16 connections per host, though your mileage may vary based on environment.
With all that said, I would recommend not configuring PoolingOptions beyond the default, other than maybe setting core and max to 8 for scenarios where you are trying to achieve higher throughputs (> 10k requests/sec).
I have two actors. Each actor is in a different ActorSystem. First caches ActorRef of a second. First actor does:
actorRef.tell(msg, self())
and sends a message to the second actor, which does some processing and replies with
getSender().tell(reply, self())
Problem: Initial tell() from first to second actor sometimes takes 1-3 minutes(!) to deliver the message.
There are no other messages sent in Akka apart from this one meaning that mailboxes are empty - system is serving a single request.
System details:
Application has 500 scheduled actors that poll Amazon SQS with a request (SQS is empty) each 30 seconds (blocking). It has another 330 actors that do nothing in my scenario. All actors are configured with default Akka dispatcher.
Box is Amazon EC2 instance with 2 cores and 8gb RAM. CPU and RAM utilization is <5%. JVM has around 1000 threads.
Initial guess is CPU starvation and context switching from too many threads. BUT is not reproducible locally on my i7 machine with 4 cores even having x10 number of actors which uses 75% of available RAM.
How can I actually find the cause of this problem? Is it possible to profile Akka infrastructure to see what takes this message to spend so much time in transit from one actor to another?
Context switching from too many threads was a probable source of this problem. To fix it the following configuration was added:
actor {
default-dispatcher {
executor = "fork-join-executor"
fork-join-executor
{ parallelism-min = 8 parallelism-factor = 12.0 parallelism-max = 64 task-peeking-mode = "FIFO" }
}
}
Thus, we increase the number of threads per physical core from 6 to 24. 24 is enough for our application to run smoothly. No starvation observed during regression tests.
I have a three part topology that's having some serious latency issues but I'm having trouble figuring out where.
kafka -> db lookup -> write to cassandra
The numbers from the storm UI look like this:
(I see that the bolts are running at > 1.0 capacity)
If the process latency for the two bolts is ~65ms why is the 'complete latency' > 400 sec? The 'failed' tuples are coming from timeouts I suspect as the latency value is steadily increasing.
The tuples are connected via shuffleGrouping.
Cassandra lives on AWS so there are likely network limitations en route.
The storm cluster has 3 machines. There are 3 workers in the topology.
Your topology has several problems:
look at the capacity of the decode_bytes_1 and save_to_cassandra spouts. Both are over 1 (the sum of all spouts capacity should be under 1), which means you are using more resources than what do you have available. This is, the topology can't handle the load.
The TOPOLOGY_MAX_SPOUT_PENDING will solve your problem if the throughput of tuples varies during the day. This is, if you have peek hours, and you will be catch up during the off-peek hours.
You need to increase the number of worker machines or optimize the code in the bottle neck spouts (or maybe both). Otherwise you will not be able to process all the tuples.
You probably can improve the cassandra persister by inserting in batches instread of insert tuples one by one...
I seriously recommend you to always set the TOPOLOGY_MAX_SPOUT_PENDING for a conservative value. The max spout pending, means the maximum number of un-acked tuples inside the topology, remember this value is multiplied by the number of spots and the tuples will timeout (fail) if they are not acknowledged 30 seconds after being emitted.
And yes, your problem is having tuples timing out, this is exactly what is happening.
(EDIT) if you are running the dev environment (or just after deploy the topology) you might experience a spike in the traffic generated by messages that were not yet consumed by the spout; it's important you prevent this case to negatively affect your topology -- you never know when you need to restart the production topology, or perform some maintenance --, if this is the case you can handle it as a temporary spike in the traffic --the spout needs to consume all the messages produced while the topology was off-line -- and after a some (or many minutes) the frequency of incoming tuples stabilizes; you can handle this with max pout pending parameter (read item 2 again).
Considering you have 3 nodes in your cluster, and cpu usage of 0,1 you can add more executers to the bolts.
FWIW - it appears that the default value for TOPOLOGY_MAX_SPOUT_PENDING is unlimited. I added a call to stormConfig.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 500); and it appears (so far) that the problem has been alleviated. Possible 'thundering herd' issue?
After setting the TOPOLOGY_MAX_SPOUT_PENDING to 500:
It seems there is a limit on the number of jobs that Quartz scheduler can run per second. In our scenario we are having about 20 jobs per second firing up for 24x7 and quartz worked well upto 10 jobs per second (with 100 quartz threads and 100 database connection pool size for a JDBC backed JobStore), however, when we increased it to 20 jobs per second, quartz became very very slow and its triggered jobs are very late compared to their actual scheduled time causing many many Misfires and eventually slowing down the overall performance of the system significantly. One interesting fact is that JobExecutionContext.getScheduledFireTime().getTime() for such delayed triggers comes to be 10-20 and even more minutes after their schedule time.
How many jobs the quartz scheduler can run per second without affecting the scheduled time of the jobs and what should be the optimum number of quartz threads for such load?
Or am I missing something here?
Details about what we want to achieve:
We have almost 10k items (categorized among 2 or more categories, in current case we have 2 categories) on which we need to some processing at given frequency e.g. 15,30,60... minutes and these items should be processed within that frequency with a given throttle per minute. e.g. lets say for 60 minutes frequency 5k items for each category should be processed with a throttle of 500 items per minute. So, ideally these items should be processed within first 10 (5000/500) minutes of each hour of the day with each minute having 500 items to be processed which are distributed evenly across the each second of the minute so we would have around 8-9 items per second for one category.
Now for to achieve this we have used Quartz as scheduler which triggers jobs for processing these items. However, we don't process each item with in the Job.execute method because it would take 5-50 seconds (averaging to 30 seconds) per item processing which involves webservice call. We rather push a message for each item processing on JMS queue and separate server machines process those jobs. I have noticed the time being taken by the Job.execute method not to be more than 30 milliseconds.
Server Details:
Solaris Sparc 64 Bit server with 8/16 cores/threads cpu for scheduler with 16GB RAM and we have two such machines in the scheduler cluster.
In a previous project, I was confronted with the same problem. In our case, Quartz performed good up a granularity of a second. Sub-second scheduling was a stretch and as you are observing, misfires happened often and the system became unreliable.
Solved this issue by creating 2 levels of scheduling: Quartz would schedule a job 'set' of n consecutive jobs. With a clustered Quartz, this means that a given server in the system would get this job 'set' to execute. The n tasks in the set are then taken in by a "micro-scheduler": basically a timing facility that used the native JDK API to further time the jobs up to the 10ms granularity.
To handle the individual jobs, we used a master-worker design, where the master was taking care of the scheduled delivery (throttling) of the jobs to a multi-threaded pool of workers.
If I had to do this again today, I'd rely on a ScheduledThreadPoolExecutor to manage the 'micro-scheduling'. For your case, it would look something like this:
ScheduledThreadPoolExecutor scheduledExecutor;
...
scheduledExecutor = new ScheduledThreadPoolExecutor(THREAD_POOL_SIZE);
...
// Evenly spread the execution of a set of tasks over a period of time
public void schedule(Set<Task> taskSet, long timePeriod, TimeUnit timeUnit) {
if (taskSet.isEmpty()) return; // or indicate some failure ...
long period = TimeUnit.MILLISECOND.convert(timePeriod, timeUnit);
long delay = period/taskSet.size();
long accumulativeDelay = 0;
for (Task task:taskSet) {
scheduledExecutor.schedule(task, accumulativeDelay, TimeUnit.MILLISECOND);
accumulativeDelay += delay;
}
}
This gives you a general idea on how use the JDK facility to micro-schedule tasks. (Disclaimer: You need to make this robust for a prod environment, like check failing tasks, manage retries (if supported), etc...).
With some testing + tuning, we found an optimal balance between the Quartz jobs and the amount of jobs in one scheduled set.
We experienced a 100X throughput improvement in this way. Network bandwidth was our actual limit.
First of all check How do I improve the performance of JDBC-JobStore? in Quartz documentation.
As you can probably guess there is in absolute value and definite metric. It all depends on your setup. However here are few hints:
20 jobs per second means around 100 database queries per second, including updates and locking. That's quite a lot!
Consider distributing your Quartz setup to cluster. However if database is a bottleneck, it won't help you. Maybe TerracottaJobStore will come to the rescue?
Having K cores in the system everything less than K will underutilize your system. If your jobs are CPU intensive, K is fine. If they are calling external web services, blocking or sleeping, consider much bigger values. However more than 100-200 threads will significantly slow down your system due to context switching.
Have you tried profiling? What is your machine doing most of the time? Can you post thread dump? I suspect poor database performance rather than CPU, but it depends on your use case.
You should limit your number of threads to somewhere between n and n*3 where n is the number of processors available. Spinning up more threads is going to cause a lot of context switching, since most of them will be blocked most of the time.
As far as jobs per second, it really depends on how long the jobs run and how often they're blocked for operations like network and disk io.
Also, something to consider is that perhaps quartz isn't the tool you need. If you're sending off 1-2 million jobs a day, you might want to look into a custom solution. What are you even doing with 2 million jobs a day?!
Another option, which is a really bad way to approach the problem, but sometimes works... what is the server it's running on? Is it an older server? It might be bumping up the ram or other specs on it will give you some extra 'umph'. Not the best solution, for sure, because that delays the problem, not addresses, but if you're in a crunch it might help.
In situations with high amount of jobs per second make sure your sql server uses row lock and not table lock. In mysql this is done by using InnoDB storage engine, and not the default MyISAM storage engine which only supplies table lock.
Fundamentally the approach of doing 1 item at a time is doomed and inefficient when you're dealing with such a large number of things to do within such a short time. You need to group things - the suggested approach of using a job set that then micro-schedules each individual job is a first step, but that still means doing a whole lot of almost nothing per job. Better would be to improve your webservice so you can tell it to process N items at a time, and then invoke it with sets of items to process. And even better is to avoid doing this sort of thing via webservices and process them all inside a database, as sets, which is what databases are good for. Any sort of job that processes one item at a time is fundamentally an unscalable design.