I am writing a Spring Service which makes N calls to another service stitches the response and returns it to the client. I am thinking that I should make the N calls in parallel and once they all finish I can stitch the response.
I am creating a new FixedThreadPoolExecutor where I want to submit one thread for each of the N. I am hearing advice that the number of threads should be:
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() + 1)
Would this be the case in this scenario. The majority of time in my app is waiting for the service I am calling to respond for each of the items.
Right now I was using new CachedThreadPool and clearly that was the wrong choice since I am noticing my threads are in the tens of thousands. I am running on machines with this spec:
CPU Reserve (in mCores*) *1000 mCores = 1 Amazon vCPU = 1/2 physical core
: 200
CPU Throttle Threshold (in Cores)
: 2
Memory Reserve (GB) : 1
Memory Throttle Threshold (GB) : 2
Related
I want to hit my application with 200,000 requests in an hour. I am using JMeter to perform this testing.
I am doing this test on a local spring boot application.
The result I am looking for is to configure JMeter thread group for a throughput of 55 hits per second.
How should I configure the thread group to get 200,000 in an hour?
So far I've tried -
First Approach
i) Reduced my requirement to 10,000 requests in 180 seconds which is a smaller proportion to my original ask.
ii) Number of threads - 10,000
iii) Ramp-up period - 180 seconds
iv) This works fine and I get the desired throughout of 55/sec, but this fails further as I increase the proportion to 50,000 threads in 900 seconds(15 minutes), I get the desired throughout but JMeter and my system becomes extremely slow and I also notice 6% error in response. I believe this is not the correct approach when I want to achieve 200k and more.
Second Approach
i) I found this solution to put in a Constant throughput timer, which I did.
ii) Number of threads - 10
iii) Ramp-up period - 900 seconds
iv) Infinite loop
v) Target throughout in minutes - 3300
vi) calculate throughout based on - all active threads.
Although I had configured the ramp-up to be 15 minutes, it seems to run for more than 40 minites until it achieves a throughput of 55/sec due to the constant throughput timer. Found no errors in response in this approach.
The easiest is going for Precise Throughput Timer
However you need to understand 2 things:
Timers can only pause JMeter samplers execution rate in order to limit JMeter's throughput to the desired value so you need to supply sufficient number of threads in the Thread Group. For instance 200k requests per hour is more or less 55 requests per second, if your application response time is 1 second - you will need 55 threads, if it's 2 seconds - you will need 110 threads and so on. The most convenient option is using Throughput Shaping Timer and Concurrency Thread Group combination, they can be connected together via Feedback Function so JMeter would be able to kick off more threads if the current amount is not sufficient to conduct the required load.
The system under test must be capable of processing that many requests per hour, if it doesn't - you won't be able to achieve this whatever you do from JMeter side.
My current system has around 100 thousand running graphs, Each is built like that:
Amqp Source ~> Processing Stage ~> Sink
Each amqp source receives messages at a rate of 1 per second. Only around 10 thousand graphs receive messages at once, So I've figured there is no need for more than 10 thousand threads running concurrently.
These are currently the settings i'm using:
my-dispatcher {
type = Dispatcher
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 16
parallelism-factor = 2.0
parallelism-max = 32
}
throughput = 20
}
Obviously these settings are not defining enough resources for the wanted performances, So I wonder:
Am I correct to assume that 10 thousand threads are enough?
Is it possible to configure the dispatcher (by editing application.conf) for that amount of threads? How would the configuration look like? Should I pick "fork-join-executor" or "thread-pool-executor" as the executor?
Thanks.
Akka and Akka Streams is based on async, an actor or stream only uses a thread for a chunk of processing and then hands the thread back to the threadpool, this is nice because you can size the threadpool according the number of cores you have to actually execute the threads rather than the things you want to execute. Having many threads will have an overhead, both in scheduling/switching and in that the JVM allocates a stack of somewhere around 0.5-1Mb per thread.
So, 10 thousand actors or running streams, can still execute fine on a small thread pool. Increasing the number of threads may rather slow the processing down than make anything faster as more time is spent on switching between threads. Even the default settings may be fine and you should always benchmark when tuning to see if the changes had the effect you expected.
Generally the fork join pool gives good performance for actors and streams. The thread-pool based one is good for use cases where you cannot avoid blocking (see this section of the docs: https://doc.akka.io/docs/akka/current/dispatchers.html#blocking-needs-careful-management)
How does AWS Lambda serve multiple requests?
I want to know is it a multi-thread kind of a model here as well?
If I am calling a Lambda from an API gateway. And there are 1000 requests in 10 secs to the API. How many containers will be created and how many threads.
How does AWS Lambda serve multiple requests?
Independently.
I want to know is it a multi-thread kind of a model here as well?
No, it is not a multi-threaded model in the sense that you are asking.
Your code can, of course, be written to use multiple threads and/or child processes to accomplish whatever purpose it is intended to accomplish for one invocation, but Lambda doesn't send more than one invocation at a time to the same container. The container is not used for a second invocation until the first one finishes. If a second request arrives while a first one is running, the second one will run in a different container.
If I am calling a Lambda from an API gateway. And there are 1000 requests in 10 secs to the API. How many containers will be created and how many threads?
As many containers will be created as are needed to process each of the arriving requests in its own container.
The duration of each invocation will be the largest determinant of this.
1000 very quick requests in 10 seconds are roughly equivalent to 100 requests in 1 second. Assuming each request finishes in less than 1 second and arrival times are evenly-distributed, you could expect fewer than 100 containers to be created.
On the other hand, if 1000 requests arrived in 10 seconds and each request took 30 seconds to complete, you would have 1000 containers in existence during this event.
After a spike in traffic inflates the number of containers, they will all tend to linger for a few minutes, ready to handle the additional load if it arrives, and then Lambda will start terminating them.
AWS Lambda is capable of serving multiple requests by horizontally scaling for multiple containers. Lambda can support up to 1000 parallel container executions by default.
there are 1000 requests in 10 secs to the API. How many containers will be created and how many threads.
Requests per second = 1000/10 = 100
There will be 100 parallel Lambda executions assuming each execution takes 1 second or more to complete.
Note: Also you can spawn multiple threads but its difficult to predict the performance gain.
Also keep in mind that, having multiple threads is not always
efficient The CPU available to your Lambda function is shared between
all threads and processes your Lambda function creates. Generally you
will not get more CPU in a Lambda function by running work in parallel
among multiple threads. Your code in this case isn’t actually running
on two cores, but on two “hyperthreads” on a single core; depending on
the workload, this may be better or worse than a single thread. The
service team is looking at ways to better leverage multiple cores in
the Lambda execution environment, and we will take your feedback as a
+1 for that feature.
Reference: AWS Forum Post
For further details on concurrent executions of Lambda, refer this aws documentation.
There are a few angles to discuss.
AWS Lambda does support handling requests in parallel, but any single instance / container of a Lambda will only process one request at a time. If all existing instances are busy then new ones will be provisioned (depending on concurrency settings, discussed below).
Within a single Lambda instance multi-threading is supported, but still only one request will be handled per instance. In practice parallelization is rarely beneficial in Lambda, it adds significant overhead and is best used for processing very large sets. Additionally, Lambdas need to have more than 1 virtual core for it to have any benefit. Cores are configured by raising the memory setting--many Lambdas run with a low enough memory setting to have just one core.
Determining exactly how many containers / instances are created isn't always possible due to there being many factors:
Lambda will reuse any existing, paused, instances
Existing instances are often very fast to handle requests, a small number of warm instances can process many, many requests in the time it takes to provision new instances (especially with runtimes like Java or .NET Core, which often have startup times of 1+ seconds)
The concurrency settings of your Lambda are a significant factor
If you have Reserved Concurrency of X, you will never have more than X instances
If you have unreserved concurrency, then the limit is based on available concurrency. This defaults to 1000 instances per account, so if 990 instances of any Lambdas already exist then only 10 could be created
If you have provisioned concurrency then you will always have a minimum number of instances, reducing cold-starts
But, to try to answer your story problem, let's assume you are sending your 1000 requests at a steady pace over the 10 minutes. That's one request every 600 milliseconds. Let's also assume your Java app is given a fairly high memory allocation, and its initialization is relatively quick -- let's say 1 second for a cold start. Once the cold start is complete invocation is fast -- let's say 10ms. And, let's assume there are no instances when the traffic begins.
The first request will see a response time of ~1,010ms -- 1 second for a cold start, and 10ms for handling the request. A second request will arrive while the first is still processing, so it's likely that Lambda will provision a second instance, and the second request will see a similar response time.
By the time the third request comes in (1800ms after the start) both instances are now idle and can be reused--so this request will not experience a cold start, and the response time will be 10ms. From this point forward it's likely that no additional instances are needed--but this all assumes a steady rate of requests.
But--changing any variable can have a big impact.
I have two actors. Each actor is in a different ActorSystem. First caches ActorRef of a second. First actor does:
actorRef.tell(msg, self())
and sends a message to the second actor, which does some processing and replies with
getSender().tell(reply, self())
Problem: Initial tell() from first to second actor sometimes takes 1-3 minutes(!) to deliver the message.
There are no other messages sent in Akka apart from this one meaning that mailboxes are empty - system is serving a single request.
System details:
Application has 500 scheduled actors that poll Amazon SQS with a request (SQS is empty) each 30 seconds (blocking). It has another 330 actors that do nothing in my scenario. All actors are configured with default Akka dispatcher.
Box is Amazon EC2 instance with 2 cores and 8gb RAM. CPU and RAM utilization is <5%. JVM has around 1000 threads.
Initial guess is CPU starvation and context switching from too many threads. BUT is not reproducible locally on my i7 machine with 4 cores even having x10 number of actors which uses 75% of available RAM.
How can I actually find the cause of this problem? Is it possible to profile Akka infrastructure to see what takes this message to spend so much time in transit from one actor to another?
Context switching from too many threads was a probable source of this problem. To fix it the following configuration was added:
actor {
default-dispatcher {
executor = "fork-join-executor"
fork-join-executor
{ parallelism-min = 8 parallelism-factor = 12.0 parallelism-max = 64 task-peeking-mode = "FIFO" }
}
}
Thus, we increase the number of threads per physical core from 6 to 24. 24 is enough for our application to run smoothly. No starvation observed during regression tests.
The recommended size for custom thread pools is number_of_cores + 1 (see here and here). So lets say there's a Spring app on a system with 2 cores and configuration is something like this
<task:executor id="taskExecutor"
pool-size="#{T(java.lang.Runtime).getRuntime().availableProcessors() + 1}" />
<task:annotation-driven executor="taskExecutor" />
In this case there will be one ExecutorService shared among several requests. So if 10 requests hit the server,
only 3 of them can be simultaneously executed inside ExecutorService. This can create a bottleneck and the results will go worse with larger number of requests (remember: by default tomcat can handle up to 200 simultaneous requests = 200 threads). The app would perform much better without any pooling.
Usually one core can cope with more than one thread at a time.
e.g. I created a service, which calls two times https://httpbin.org/delay/2. Each call takes 2 seconds to execute. So if no thread pool is used the service responds in average 4.5 seconds (tested this with 20 simultaneous requests).
If thread pool is used the reponses vary depending on the pool size and hardware. I run a test on 4 core machine with different pool sizes. Below are the test results with min, max and average response times in milliseconds
From the results, it can be concluded that the best average time was with the max pool size. The average time with 5 (4 cores + 1) threads in the pool is even worse than the result without pooling. So in my opinion, if a request does not take much of a CPU time then it does not make sense to limit a thread pool to number of cores + 1 in web application.
Does anyone find anything wrong with setting a pool size to 20 (or even more) on 2 or 4 core machine for web services that are not CPU demanding?
Both of the pages you link to explicitly says that this only applies to unblocked, CPU bound tasks.
You have tasks that are blocked waiting on the remote machine, so this advice does not apply.
There is nothing wrong with ignoring advice that doesn't apply to you, so you can and should increase the thread pool size.
Good advice comes with a rationale so you can tell when it becomes bad advice. If you don’t understanding why something should be done, then you’ve fallen into the trap of cargo cult programming, and you’ll keep doing it even when it’s no longer necessary or even becomes deleterious.
Raymond Chen, on his blog "The Old New Thing"