Akka - ActorRef.tell() takes several minutes to deliver the message - java

I have two actors. Each actor is in a different ActorSystem. First caches ActorRef of a second. First actor does:
actorRef.tell(msg, self())
and sends a message to the second actor, which does some processing and replies with
getSender().tell(reply, self())
Problem: Initial tell() from first to second actor sometimes takes 1-3 minutes(!) to deliver the message.
There are no other messages sent in Akka apart from this one meaning that mailboxes are empty - system is serving a single request.
System details:
Application has 500 scheduled actors that poll Amazon SQS with a request (SQS is empty) each 30 seconds (blocking). It has another 330 actors that do nothing in my scenario. All actors are configured with default Akka dispatcher.
Box is Amazon EC2 instance with 2 cores and 8gb RAM. CPU and RAM utilization is <5%. JVM has around 1000 threads.
Initial guess is CPU starvation and context switching from too many threads. BUT is not reproducible locally on my i7 machine with 4 cores even having x10 number of actors which uses 75% of available RAM.
How can I actually find the cause of this problem? Is it possible to profile Akka infrastructure to see what takes this message to spend so much time in transit from one actor to another?

Context switching from too many threads was a probable source of this problem. To fix it the following configuration was added:
actor {
default-dispatcher {
executor = "fork-join-executor"
fork-join-executor
{ parallelism-min = 8 parallelism-factor = 12.0 parallelism-max = 64 task-peeking-mode = "FIFO" }
}
}
Thus, we increase the number of threads per physical core from 6 to 24. 24 is enough for our application to run smoothly. No starvation observed during regression tests.

Related

Configuring akka dispatcher for large amount of concurrent graphs

My current system has around 100 thousand running graphs, Each is built like that:
Amqp Source ~> Processing Stage ~> Sink
Each amqp source receives messages at a rate of 1 per second. Only around 10 thousand graphs receive messages at once, So I've figured there is no need for more than 10 thousand threads running concurrently.
These are currently the settings i'm using:
my-dispatcher {
type = Dispatcher
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 16
parallelism-factor = 2.0
parallelism-max = 32
}
throughput = 20
}
Obviously these settings are not defining enough resources for the wanted performances, So I wonder:
Am I correct to assume that 10 thousand threads are enough?
Is it possible to configure the dispatcher (by editing application.conf) for that amount of threads? How would the configuration look like? Should I pick "fork-join-executor" or "thread-pool-executor" as the executor?
Thanks.
Akka and Akka Streams is based on async, an actor or stream only uses a thread for a chunk of processing and then hands the thread back to the threadpool, this is nice because you can size the threadpool according the number of cores you have to actually execute the threads rather than the things you want to execute. Having many threads will have an overhead, both in scheduling/switching and in that the JVM allocates a stack of somewhere around 0.5-1Mb per thread.
So, 10 thousand actors or running streams, can still execute fine on a small thread pool. Increasing the number of threads may rather slow the processing down than make anything faster as more time is spent on switching between threads. Even the default settings may be fine and you should always benchmark when tuning to see if the changes had the effect you expected.
Generally the fork join pool gives good performance for actors and streams. The thread-pool based one is good for use cases where you cannot avoid blocking (see this section of the docs: https://doc.akka.io/docs/akka/current/dispatchers.html#blocking-needs-careful-management)

Optimal ThreadPoolSize for Spring Services

I am writing a Spring Service which makes N calls to another service stitches the response and returns it to the client. I am thinking that I should make the N calls in parallel and once they all finish I can stitch the response.
I am creating a new FixedThreadPoolExecutor where I want to submit one thread for each of the N. I am hearing advice that the number of threads should be:
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() + 1)
Would this be the case in this scenario. The majority of time in my app is waiting for the service I am calling to respond for each of the items.
Right now I was using new CachedThreadPool and clearly that was the wrong choice since I am noticing my threads are in the tens of thousands. I am running on machines with this spec:
CPU Reserve (in mCores*) *1000 mCores = 1 Amazon vCPU = 1/2 physical core
: 200
CPU Throttle Threshold (in Cores)
: 2
Memory Reserve (GB) : 1
Memory Throttle Threshold (GB) : 2

Thread pool size should be much more than number of cores + 1

The recommended size for custom thread pools is number_of_cores + 1 (see here and here). So lets say there's a Spring app on a system with 2 cores and configuration is something like this
<task:executor id="taskExecutor"
pool-size="#{T(java.lang.Runtime).getRuntime().availableProcessors() + 1}" />
<task:annotation-driven executor="taskExecutor" />
In this case there will be one ExecutorService shared among several requests. So if 10 requests hit the server,
only 3 of them can be simultaneously executed inside ExecutorService. This can create a bottleneck and the results will go worse with larger number of requests (remember: by default tomcat can handle up to 200 simultaneous requests = 200 threads). The app would perform much better without any pooling.
Usually one core can cope with more than one thread at a time.
e.g. I created a service, which calls two times https://httpbin.org/delay/2. Each call takes 2 seconds to execute. So if no thread pool is used the service responds in average 4.5 seconds (tested this with 20 simultaneous requests).
If thread pool is used the reponses vary depending on the pool size and hardware. I run a test on 4 core machine with different pool sizes. Below are the test results with min, max and average response times in milliseconds
From the results, it can be concluded that the best average time was with the max pool size. The average time with 5 (4 cores + 1) threads in the pool is even worse than the result without pooling. So in my opinion, if a request does not take much of a CPU time then it does not make sense to limit a thread pool to number of cores + 1 in web application.
Does anyone find anything wrong with setting a pool size to 20 (or even more) on 2 or 4 core machine for web services that are not CPU demanding?
Both of the pages you link to explicitly says that this only applies to unblocked, CPU bound tasks.
You have tasks that are blocked waiting on the remote machine, so this advice does not apply.
There is nothing wrong with ignoring advice that doesn't apply to you, so you can and should increase the thread pool size.
Good advice comes with a rationale so you can tell when it becomes bad advice. If you don’t understanding why something should be done, then you’ve fallen into the trap of cargo cult programming, and you’ll keep doing it even when it’s no longer necessary or even becomes deleterious.
Raymond Chen, on his blog "The Old New Thing"

busy spin to reduce context switch latency (java)

In my application there are several services that process information on their own thread, when they are done they post a message to the next service which then continue to do its work on its own thread. The handover of messages is done via a LinkedBlockingQueue. The handover normally takes 50-80 us (from putting a message on the queue until the consumer starts to process the message).
To speed up the handover on the most important services I wanted to use a busy spin instead of a blocking approach (I have 12 processor cores and want to dedicate 3 to these important services).
So.. I changed LinkedBlockingQueue to ConcurrentLinkedQueue
and did
for(;;)
{
Message m = queue.poll();
if( m != null )
....
}
Now.. the result is that the first message pass takes 1 us, but then the latency increases over the next 25 handovers until reaches 500 us and then the latency is suddenly back to 1 us and the starts to increase.. So I have latency cycles with 25 iterations where latency starts at 1 us and ends at 500 us. (message are passed approximately 100 times per second)
with an average latency of 250 it is not exactly the performance gain I was looking for.
I also tried to use the LMAX Disruptor ringbuffer instead of the ConcurrentLinkedQueue. That framwork have its own build in busy spin implementation and a quite different queue implementation, but the result was the same. So im quite certain that its not the fault of the queue or me abusing something..
Question is.. What the Heck is going on here? Why am I seeing this strange latency cycles?
Cheers!!
As far as I know thread scheduler can deliberately pause a thread for a longer time if it detects that this thread is using CPU quite intensively - to distribute CPU time between different threads fairer. Try adding LockSupport.park() in the consumer after queue is empty and LockSupport.unpark() in the producer after adding the message - it might make the latency less variable; whether it will actually be better comparing to blocking queue is a big question though.
If you really need doing the job the way you described (and not the way Andrey Nudko replied at Jan 5 at 13:22), then you definitedly need looking at the problem also from other viewpoints.
Just some hints:
Try checking your overall environment (outside the JVM). For example:
OS CPU scheduler has a huge impact on this..currently the default is very likely
http://en.wikipedia.org/wiki/Completely_Fair_Scheduler
number of running processes, etc.
"problems" inside your JVM
garbage collector (try different one: http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html#1.1.%20Types%20of%20Collectors%7Coutline)
Try changing thread priorities: Setting priority to Java's threads
This is just wild speculation (since as others have mentioned, you're not gathering any information on the queue length, failed polls, null polls etc.):
I used the force and read the source of ConcurrentLinkedQueue, or rather, briefly leafed through it for a minute or two. The polling is not quite your trivial O(1) operation. It might be the case that you're traversing more than a few nodes which have become stale, holding null; and there might be additional transitory states involving nodes linking to themselves as the next node as indication of staleness/removal from the queue. It may be that the queue is starting to build up garbage due to thread scheduling. Try following the links to the abstract algorithm mentioned in the code:
Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue by Maged M. Michael and Michael L. Scott (link has a PDF and pseudocode).
Here is my 2 cents. If you are running on linux/unix based systems, there is a way to dedicate a certain cpu to a certain thread. In essence, you can make the OS ignore that cpu for any scheduling. Checkout the isolation levels for cpu

Quartz Performance

It seems there is a limit on the number of jobs that Quartz scheduler can run per second. In our scenario we are having about 20 jobs per second firing up for 24x7 and quartz worked well upto 10 jobs per second (with 100 quartz threads and 100 database connection pool size for a JDBC backed JobStore), however, when we increased it to 20 jobs per second, quartz became very very slow and its triggered jobs are very late compared to their actual scheduled time causing many many Misfires and eventually slowing down the overall performance of the system significantly. One interesting fact is that JobExecutionContext.getScheduledFireTime().getTime() for such delayed triggers comes to be 10-20 and even more minutes after their schedule time.
How many jobs the quartz scheduler can run per second without affecting the scheduled time of the jobs and what should be the optimum number of quartz threads for such load?
Or am I missing something here?
Details about what we want to achieve:
We have almost 10k items (categorized among 2 or more categories, in current case we have 2 categories) on which we need to some processing at given frequency e.g. 15,30,60... minutes and these items should be processed within that frequency with a given throttle per minute. e.g. lets say for 60 minutes frequency 5k items for each category should be processed with a throttle of 500 items per minute. So, ideally these items should be processed within first 10 (5000/500) minutes of each hour of the day with each minute having 500 items to be processed which are distributed evenly across the each second of the minute so we would have around 8-9 items per second for one category.
Now for to achieve this we have used Quartz as scheduler which triggers jobs for processing these items. However, we don't process each item with in the Job.execute method because it would take 5-50 seconds (averaging to 30 seconds) per item processing which involves webservice call. We rather push a message for each item processing on JMS queue and separate server machines process those jobs. I have noticed the time being taken by the Job.execute method not to be more than 30 milliseconds.
Server Details:
Solaris Sparc 64 Bit server with 8/16 cores/threads cpu for scheduler with 16GB RAM and we have two such machines in the scheduler cluster.
In a previous project, I was confronted with the same problem. In our case, Quartz performed good up a granularity of a second. Sub-second scheduling was a stretch and as you are observing, misfires happened often and the system became unreliable.
Solved this issue by creating 2 levels of scheduling: Quartz would schedule a job 'set' of n consecutive jobs. With a clustered Quartz, this means that a given server in the system would get this job 'set' to execute. The n tasks in the set are then taken in by a "micro-scheduler": basically a timing facility that used the native JDK API to further time the jobs up to the 10ms granularity.
To handle the individual jobs, we used a master-worker design, where the master was taking care of the scheduled delivery (throttling) of the jobs to a multi-threaded pool of workers.
If I had to do this again today, I'd rely on a ScheduledThreadPoolExecutor to manage the 'micro-scheduling'. For your case, it would look something like this:
ScheduledThreadPoolExecutor scheduledExecutor;
...
scheduledExecutor = new ScheduledThreadPoolExecutor(THREAD_POOL_SIZE);
...
// Evenly spread the execution of a set of tasks over a period of time
public void schedule(Set<Task> taskSet, long timePeriod, TimeUnit timeUnit) {
if (taskSet.isEmpty()) return; // or indicate some failure ...
long period = TimeUnit.MILLISECOND.convert(timePeriod, timeUnit);
long delay = period/taskSet.size();
long accumulativeDelay = 0;
for (Task task:taskSet) {
scheduledExecutor.schedule(task, accumulativeDelay, TimeUnit.MILLISECOND);
accumulativeDelay += delay;
}
}
This gives you a general idea on how use the JDK facility to micro-schedule tasks. (Disclaimer: You need to make this robust for a prod environment, like check failing tasks, manage retries (if supported), etc...).
With some testing + tuning, we found an optimal balance between the Quartz jobs and the amount of jobs in one scheduled set.
We experienced a 100X throughput improvement in this way. Network bandwidth was our actual limit.
First of all check How do I improve the performance of JDBC-JobStore? in Quartz documentation.
As you can probably guess there is in absolute value and definite metric. It all depends on your setup. However here are few hints:
20 jobs per second means around 100 database queries per second, including updates and locking. That's quite a lot!
Consider distributing your Quartz setup to cluster. However if database is a bottleneck, it won't help you. Maybe TerracottaJobStore will come to the rescue?
Having K cores in the system everything less than K will underutilize your system. If your jobs are CPU intensive, K is fine. If they are calling external web services, blocking or sleeping, consider much bigger values. However more than 100-200 threads will significantly slow down your system due to context switching.
Have you tried profiling? What is your machine doing most of the time? Can you post thread dump? I suspect poor database performance rather than CPU, but it depends on your use case.
You should limit your number of threads to somewhere between n and n*3 where n is the number of processors available. Spinning up more threads is going to cause a lot of context switching, since most of them will be blocked most of the time.
As far as jobs per second, it really depends on how long the jobs run and how often they're blocked for operations like network and disk io.
Also, something to consider is that perhaps quartz isn't the tool you need. If you're sending off 1-2 million jobs a day, you might want to look into a custom solution. What are you even doing with 2 million jobs a day?!
Another option, which is a really bad way to approach the problem, but sometimes works... what is the server it's running on? Is it an older server? It might be bumping up the ram or other specs on it will give you some extra 'umph'. Not the best solution, for sure, because that delays the problem, not addresses, but if you're in a crunch it might help.
In situations with high amount of jobs per second make sure your sql server uses row lock and not table lock. In mysql this is done by using InnoDB storage engine, and not the default MyISAM storage engine which only supplies table lock.
Fundamentally the approach of doing 1 item at a time is doomed and inefficient when you're dealing with such a large number of things to do within such a short time. You need to group things - the suggested approach of using a job set that then micro-schedules each individual job is a first step, but that still means doing a whole lot of almost nothing per job. Better would be to improve your webservice so you can tell it to process N items at a time, and then invoke it with sets of items to process. And even better is to avoid doing this sort of thing via webservices and process them all inside a database, as sets, which is what databases are good for. Any sort of job that processes one item at a time is fundamentally an unscalable design.

Categories

Resources