Quartz Performance

Quartz Performance - java

It seems there is a limit on the number of jobs that Quartz scheduler can run per second. In our scenario we are having about 20 jobs per second firing up for 24x7 and quartz worked well upto 10 jobs per second (with 100 quartz threads and 100 database connection pool size for a JDBC backed JobStore), however, when we increased it to 20 jobs per second, quartz became very very slow and its triggered jobs are very late compared to their actual scheduled time causing many many Misfires and eventually slowing down the overall performance of the system significantly. One interesting fact is that JobExecutionContext.getScheduledFireTime().getTime() for such delayed triggers comes to be 10-20 and even more minutes after their schedule time.
How many jobs the quartz scheduler can run per second without affecting the scheduled time of the jobs and what should be the optimum number of quartz threads for such load?
Or am I missing something here?
Details about what we want to achieve:
We have almost 10k items (categorized among 2 or more categories, in current case we have 2 categories) on which we need to some processing at given frequency e.g. 15,30,60... minutes and these items should be processed within that frequency with a given throttle per minute. e.g. lets say for 60 minutes frequency 5k items for each category should be processed with a throttle of 500 items per minute. So, ideally these items should be processed within first 10 (5000/500) minutes of each hour of the day with each minute having 500 items to be processed which are distributed evenly across the each second of the minute so we would have around 8-9 items per second for one category.
Now for to achieve this we have used Quartz as scheduler which triggers jobs for processing these items. However, we don't process each item with in the Job.execute method because it would take 5-50 seconds (averaging to 30 seconds) per item processing which involves webservice call. We rather push a message for each item processing on JMS queue and separate server machines process those jobs. I have noticed the time being taken by the Job.execute method not to be more than 30 milliseconds.
Server Details:
Solaris Sparc 64 Bit server with 8/16 cores/threads cpu for scheduler with 16GB RAM and we have two such machines in the scheduler cluster.

In a previous project, I was confronted with the same problem. In our case, Quartz performed good up a granularity of a second. Sub-second scheduling was a stretch and as you are observing, misfires happened often and the system became unreliable.
Solved this issue by creating 2 levels of scheduling: Quartz would schedule a job 'set' of n consecutive jobs. With a clustered Quartz, this means that a given server in the system would get this job 'set' to execute. The n tasks in the set are then taken in by a "micro-scheduler": basically a timing facility that used the native JDK API to further time the jobs up to the 10ms granularity.
To handle the individual jobs, we used a master-worker design, where the master was taking care of the scheduled delivery (throttling) of the jobs to a multi-threaded pool of workers.
If I had to do this again today, I'd rely on a ScheduledThreadPoolExecutor to manage the 'micro-scheduling'. For your case, it would look something like this:
ScheduledThreadPoolExecutor scheduledExecutor;
...
scheduledExecutor = new ScheduledThreadPoolExecutor(THREAD_POOL_SIZE);
...
// Evenly spread the execution of a set of tasks over a period of time
public void schedule(Set<Task> taskSet, long timePeriod, TimeUnit timeUnit) {
if (taskSet.isEmpty()) return; // or indicate some failure ...
long period = TimeUnit.MILLISECOND.convert(timePeriod, timeUnit);
long delay = period/taskSet.size();
long accumulativeDelay = 0;
for (Task task:taskSet) {
scheduledExecutor.schedule(task, accumulativeDelay, TimeUnit.MILLISECOND);
accumulativeDelay += delay;
}
}
This gives you a general idea on how use the JDK facility to micro-schedule tasks. (Disclaimer: You need to make this robust for a prod environment, like check failing tasks, manage retries (if supported), etc...).
With some testing + tuning, we found an optimal balance between the Quartz jobs and the amount of jobs in one scheduled set.
We experienced a 100X throughput improvement in this way. Network bandwidth was our actual limit.

First of all check How do I improve the performance of JDBC-JobStore? in Quartz documentation.
As you can probably guess there is in absolute value and definite metric. It all depends on your setup. However here are few hints:
20 jobs per second means around 100 database queries per second, including updates and locking. That's quite a lot!
Consider distributing your Quartz setup to cluster. However if database is a bottleneck, it won't help you. Maybe TerracottaJobStore will come to the rescue?
Having K cores in the system everything less than K will underutilize your system. If your jobs are CPU intensive, K is fine. If they are calling external web services, blocking or sleeping, consider much bigger values. However more than 100-200 threads will significantly slow down your system due to context switching.
Have you tried profiling? What is your machine doing most of the time? Can you post thread dump? I suspect poor database performance rather than CPU, but it depends on your use case.

You should limit your number of threads to somewhere between n and n*3 where n is the number of processors available. Spinning up more threads is going to cause a lot of context switching, since most of them will be blocked most of the time.
As far as jobs per second, it really depends on how long the jobs run and how often they're blocked for operations like network and disk io.
Also, something to consider is that perhaps quartz isn't the tool you need. If you're sending off 1-2 million jobs a day, you might want to look into a custom solution. What are you even doing with 2 million jobs a day?!
Another option, which is a really bad way to approach the problem, but sometimes works... what is the server it's running on? Is it an older server? It might be bumping up the ram or other specs on it will give you some extra 'umph'. Not the best solution, for sure, because that delays the problem, not addresses, but if you're in a crunch it might help.

In situations with high amount of jobs per second make sure your sql server uses row lock and not table lock. In mysql this is done by using InnoDB storage engine, and not the default MyISAM storage engine which only supplies table lock.

Fundamentally the approach of doing 1 item at a time is doomed and inefficient when you're dealing with such a large number of things to do within such a short time. You need to group things - the suggested approach of using a job set that then micro-schedules each individual job is a first step, but that still means doing a whole lot of almost nothing per job. Better would be to improve your webservice so you can tell it to process N items at a time, and then invoke it with sets of items to process. And even better is to avoid doing this sort of thing via webservices and process them all inside a database, as sets, which is what databases are good for. Any sort of job that processes one item at a time is fundamentally an unscalable design.

Related

Configuring akka dispatcher for large amount of concurrent graphs

My current system has around 100 thousand running graphs, Each is built like that:
Amqp Source ~> Processing Stage ~> Sink
Each amqp source receives messages at a rate of 1 per second. Only around 10 thousand graphs receive messages at once, So I've figured there is no need for more than 10 thousand threads running concurrently.
These are currently the settings i'm using:
my-dispatcher {
type = Dispatcher
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 16
parallelism-factor = 2.0
parallelism-max = 32
}
throughput = 20
}
Obviously these settings are not defining enough resources for the wanted performances, So I wonder:
Am I correct to assume that 10 thousand threads are enough?
Is it possible to configure the dispatcher (by editing application.conf) for that amount of threads? How would the configuration look like? Should I pick "fork-join-executor" or "thread-pool-executor" as the executor?
Thanks.

Akka and Akka Streams is based on async, an actor or stream only uses a thread for a chunk of processing and then hands the thread back to the threadpool, this is nice because you can size the threadpool according the number of cores you have to actually execute the threads rather than the things you want to execute. Having many threads will have an overhead, both in scheduling/switching and in that the JVM allocates a stack of somewhere around 0.5-1Mb per thread.
So, 10 thousand actors or running streams, can still execute fine on a small thread pool. Increasing the number of threads may rather slow the processing down than make anything faster as more time is spent on switching between threads. Even the default settings may be fine and you should always benchmark when tuning to see if the changes had the effect you expected.
Generally the fork join pool gives good performance for actors and streams. The thread-pool based one is good for use cases where you cannot avoid blocking (see this section of the docs: https://doc.akka.io/docs/akka/current/dispatchers.html#blocking-needs-careful-management)

Schedulers.elastic() not using early generated threads

I've written web-server on spring-boot-2.0 which uses netty under the hood.
This application uses Schedulers.elastic() threads.
When it started, about 100 elastic threads were created. These threads were rarely used and we've had few loading. But after a working day, the number of threads in elastic pool has increased to 1300. And now execution is on the elastic-1XXX, elastic-12XX threads, (name's numbers are above 100 and even 900).
Elastic, as I understand it, uses cachedThreadPool under the hood.
Why have new elastic threads been created and why has task switched to new threads?
What is the criteria for adding new trends?
And why haven't old threads (elastic-XX, elastic-1xx) been shutdown?

Without more information about the type of workload, the maximum concurrency, burst and average task duration, it’s really hard to tell if there’s a problem here.
By definition the elastic scheduler is creating an unbounded number of threads, as long as new tasks are added to the queue.
If the workload is bursty, with night concurrency at regular times, then it’s n unexpected to find a large number of threads. You could leverage the newElastic variants to reduce the TTL (default is 60s).
Again, without more information it’s hard to tell but your workload might not fit this scheduler. If your workload is CPU bound, the parallel scheduler is a better fit. The elastic one is tailored for IO/latency bound tasks.

The problem was: i was using Schedulers.elastic() for non-blocking operations, while there were no such operations . When i had removed elastic(), my service started to working correctly (without elastic's threads).

Executors.newFixedThreadPool() - how expensive is this operation

I have a requirement where i need to process some tasks for current live shows.
This is a scheduled tasks and runs every minute.
At any given minute, there can be any number of live shows(though number cannot be that large, approx max 10). There are more than 20 functionalities needs to be done for all the live shows. or say 20 worker classes are there , all are doing there job.
Let say for first functionality, there are 5 shows, then after few minutes shows reduced to 2, then again after few minutes shows increase to 7.
Currently i am doing something like this,
int totalShowsCount = getCurrentShowsCount();
ExecutorService executor = Executors.newFixedThreadPool(showIds.size());
The above statements gets executed every minute.
Problem Statement
1.) How much expensive the above operation be..??. Creating fixedThreadPool at every given minute.
2.) What can i do to optimize my solution, should i use a fixed thread pool, say (10), and maybe 3 or 5 or 6 or any number of threads getting utilized at any given minute.
Can i create a fixed thread pool at worker level, and maintain it and
utilize that.
FYI, using Java8, if any better approach is available.

How much expensive the above operation be..??. Creating fixedThreadPool at every given minute.
Creating a thread pool is a relatively expensive operation which can take milli-seconds. You don't want to be doing this many times per second.
A second is an eternity for a computer, if you have a 36 core machine it can execute as much as 100 billion instructions in that amount of time. A minute is a very, very long time, and if you only do something once a minute you could even restart your JVM every minute and still get reasonable throughput.
What can i do to optimize my solution, should i use a fixed thread pool, say (10), and maybe 3 or 5 or 6 or any number of threads getting utilized at any given minute.
Possibly, it depends on what you are doing. Without most analysis you could say for sure. Note: If you are using parallelStream(), if not you should see if you can, you can use the built in ForkJoinPool.commonPool() and not need to create another pool. But again, this depends on what you are doing.

Java concurrency based on available FREE cpu

QUESTION
How do I scale to use more threads if and only if there is free cpu?
Something like a ThreadPoolExecutor that uses more threads when cpu cores are idle, and less or just one if not.
USE CASE
Current situation:
My Java server app processes requests and serves results.
There is a ThreadPoolExecutor to serve the requests with a reasonable number of max threads following the principle: number of cpu cores = number of max threads.
The work performed is cpu heavy, and there's some disk IO (DBs).
The code is linear, single threaded.
A single request takes between 50 and 500 ms to process.
Sometimes there are just a few requests per minute, and other times there are 30 simultaneous.
A modern server with 12 cores handles the load nicely.
The throughput is good, the latency is ok.
Desired improvement:
When there is a low number of requests, as is the case most of the time, many cpu cores are idle.
Latency could be improved in this case by running some of the code for a single request multi-threaded.
Some prototyping shows improvements, but as soon as I test with a higher number of concurrent requests,
the server goes bananas. Throughput goes down, memory consumption goes overboard.
30 simultaneous requests sharing a queue of 10 meaning that 10 can run at most while 20 are waiting,
and each of the 10 uses up to 8 threads at once for parallelism, seems to be too much for a machine
with 12 cores (out of which 6 are virtual).
This seems to me like a common use case, yet I could not find information by searching.
IDEAS
1) request counting
One idea is to count the current number of processed requests. If 1 or low then do more parallelism,
if high then don't do any and continue single-threaded as before.
This sounds simple to implement. Drawbacks are: request counter resetting must not contain bugs,
think finally. And it does not actually check available cpu, maybe another process uses cpu also.
In my case the machine is dedicated to just this application, but still.
2) actual cpu querying
I'd think that the correct approach would be to just ask the cpu, and then decide.
Since Java7 there is OperatingSystemMXBean.getSystemCpuLoad() see http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getSystemCpuLoad()
but I can't find any webpage that mentions getSystemCpuLoad and ThreadPoolExecutor, or a similar
combination of keywords, which tells me that's not a good path to go.
The JavaDoc says "Returns the "recent cpu usage" for the whole system", and I'm wondering what
"recent cpu usage" means, how recent that is, and how expensive that call is.
UPDATE
I had left this question open for a while to see if more input is coming. Nope. Although I don't like the "no-can-do" answer to technical questions, I'm going to accept Holger's answer now. He has good reputation, good arguments, and others have approved his answer.
Myself I had experimented with idea 2 a bit. I queried the getSystemCpuLoad() in tasks to decide how large their own ExecutorService could be. As Holger wrote, when there is a SINGLE ExecutorService, resources can be managed well. But as soon as tasks start their own tasks, they cannot - it didn't work out for me.

There is no way of limiting based on “free CPU” and it wouldn’t work anyway. The information about “free CPU” is outdated as soon as you get it. Suppose you have twelve threads running concurrently and detecting at the same time that there is one free CPU core and decide to schedule a sub-task…
What you can do is limiting the maximum resource consumption which works quite well when using a single ExecutorService with a maximum number of threads for all tasks.
The tricky part is the dependency of the tasks on the result of the sub-tasks which are enqueued at a later time and might still be pending due to the the limited number of worker threads.
This can be adjusted by revoking the parallel execution if the task detects that its sub-task is still pending. For this to work, create a FutureTask for the sub-task manually and schedule it with execute rather than submit. Then proceed within the task as normally and at the place where you would perform the sub-task in a sequential implementation check whether you can remove the FutureTask from the ThreadPoolExecutor. Unlike cancel this works only if it has not started yet and hence is an indicator that there are no free threads. So if remove returns true you can perform the sub-task in-place letting all other threads perform tasks rather than sub-tasks. Otherwise, you can wait for the result.
At this place it’s worth noting that it is ok to have more threads than CPU cores if the tasks accommodate I/O operations (or may wait for sub-tasks). The important point here is to have a limit.
FutureTask<Integer> coWorker = new FutureTask<>(/* callable wrapping sub-task*/);
executor.execute(coWorker);
// proceed in the task’s sequence
if(executor.remove(coWorker)) coWorker.run();// do in-place if needed
subTaskResult=coWorker.get();
// proceed

It sounds like the ForkJoinPool introduced in Java 7 would be exactly what you need. The ForkJoinPool is specifically designed to keep all your CPUs exactly busy meaning that there are as many threads as there are CPUs and that all those threads are also working and not blocking (For the later make sure that you use ManagedBlockers for DB queries).
In a ForkJoinTask there is the method getSurplusQueuedTaskCount for which the JavaDoc says "This value may be useful for heuristic decisions about whether to fork other tasks." and as such serves as a better replacement for your getSystemCpuLoad solution to make decisions about task decompositions. This allows you to reduce the number of decompositions when system load is high and thus reduce the impact of the task decomposition overhead.
Also see my answer here for some more indepth explanation about the principles of Fork/Join-pools.

Program execution slows down the more threads I have running (Java)

I'm experiencing some strange behaviour in a java program. Basically, I have a list of items to process, which I can choose to process one at a time, or all at once (which means 3-4 at a time). Each item needs about 10 threads to be processed, so processing 1 item at a time = 10 threads, 2 at a time = 20 threads, 4 at a time = 40 threads, etc.
Here's the strange thing, if I process just one item, its done in approx 50-150 milliseconds. But if I process 2 at a time, it goes up to 200-300 ms per item. 3 at a time = 300-500MS per item, 4 at a time = 400-700 MS per item, etc.
Why is this happening? I've done prior research which says that jvm can handle upto 3000-4000 threads easily, so why does it slow down with just 30-40 threads for me? Is this normal behavior? I thought that having 40 threads would mean each thread would work in parallel rather than in a queue as it seems to be.

How many CPU cores do you have?
If I have one CPU core, and I max out a single threaded application on it, the CPU is always busy, if I give it two threads, both doing this heavy task I don't get double-the-cpu, no, they both get ~0.5 seconds / second (seconds per second) of CPU time take away the time the OS needs to switch threads.
So it doubles the time taken for each thread to work, but they might finish at about the same time (depending on the scheduler)
If you have two CPU cores.... then it'd (theoretically again) finish in the same time as one thread, because one thread can't use two cpu cores (at the same time)
Then there's hardware threads, some threads yield or sleep, if they're reading/writing the OS will run other threads while they are blocked, so forth....
Does this help?

It would be nice to see some source code.
Without it i have only 4 assumptions :
1) You haven't done the load balancing. You should consider about optimal number of threads.
2) Work, executed by each thread does not justify the time, needed to setup and start the thread (+ context switching time).
3) There is the real problems with your code quality
4) Weak hardware

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.