How does AWS Lambda serve multiple requests? - java

How does AWS Lambda serve multiple requests?
I want to know is it a multi-thread kind of a model here as well?
If I am calling a Lambda from an API gateway. And there are 1000 requests in 10 secs to the API. How many containers will be created and how many threads.

How does AWS Lambda serve multiple requests?
I want to know is it a multi-thread kind of a model here as well?
No, it is not a multi-threaded model in the sense that you are asking.
Your code can, of course, be written to use multiple threads and/or child processes to accomplish whatever purpose it is intended to accomplish for one invocation, but Lambda doesn't send more than one invocation at a time to the same container. The container is not used for a second invocation until the first one finishes. If a second request arrives while a first one is running, the second one will run in a different container.
If I am calling a Lambda from an API gateway. And there are 1000 requests in 10 secs to the API. How many containers will be created and how many threads?
As many containers will be created as are needed to process each of the arriving requests in its own container.
The duration of each invocation will be the largest determinant of this.
1000 very quick requests in 10 seconds are roughly equivalent to 100 requests in 1 second. Assuming each request finishes in less than 1 second and arrival times are evenly-distributed, you could expect fewer than 100 containers to be created.
On the other hand, if 1000 requests arrived in 10 seconds and each request took 30 seconds to complete, you would have 1000 containers in existence during this event.
After a spike in traffic inflates the number of containers, they will all tend to linger for a few minutes, ready to handle the additional load if it arrives, and then Lambda will start terminating them.

AWS Lambda is capable of serving multiple requests by horizontally scaling for multiple containers. Lambda can support up to 1000 parallel container executions by default.
there are 1000 requests in 10 secs to the API. How many containers will be created and how many threads.
Requests per second = 1000/10 = 100
There will be 100 parallel Lambda executions assuming each execution takes 1 second or more to complete.
Note: Also you can spawn multiple threads but its difficult to predict the performance gain.
Also keep in mind that, having multiple threads is not always
efficient The CPU available to your Lambda function is shared between
all threads and processes your Lambda function creates. Generally you
will not get more CPU in a Lambda function by running work in parallel
among multiple threads. Your code in this case isn’t actually running
on two cores, but on two “hyperthreads” on a single core; depending on
the workload, this may be better or worse than a single thread. The
service team is looking at ways to better leverage multiple cores in
the Lambda execution environment, and we will take your feedback as a
+1 for that feature.
Reference: AWS Forum Post
For further details on concurrent executions of Lambda, refer this aws documentation.

There are a few angles to discuss.
AWS Lambda does support handling requests in parallel, but any single instance / container of a Lambda will only process one request at a time. If all existing instances are busy then new ones will be provisioned (depending on concurrency settings, discussed below).
Within a single Lambda instance multi-threading is supported, but still only one request will be handled per instance. In practice parallelization is rarely beneficial in Lambda, it adds significant overhead and is best used for processing very large sets. Additionally, Lambdas need to have more than 1 virtual core for it to have any benefit. Cores are configured by raising the memory setting--many Lambdas run with a low enough memory setting to have just one core.
Determining exactly how many containers / instances are created isn't always possible due to there being many factors:
Lambda will reuse any existing, paused, instances
Existing instances are often very fast to handle requests, a small number of warm instances can process many, many requests in the time it takes to provision new instances (especially with runtimes like Java or .NET Core, which often have startup times of 1+ seconds)
The concurrency settings of your Lambda are a significant factor
If you have Reserved Concurrency of X, you will never have more than X instances
If you have unreserved concurrency, then the limit is based on available concurrency. This defaults to 1000 instances per account, so if 990 instances of any Lambdas already exist then only 10 could be created
If you have provisioned concurrency then you will always have a minimum number of instances, reducing cold-starts
But, to try to answer your story problem, let's assume you are sending your 1000 requests at a steady pace over the 10 minutes. That's one request every 600 milliseconds. Let's also assume your Java app is given a fairly high memory allocation, and its initialization is relatively quick -- let's say 1 second for a cold start. Once the cold start is complete invocation is fast -- let's say 10ms. And, let's assume there are no instances when the traffic begins.
The first request will see a response time of ~1,010ms -- 1 second for a cold start, and 10ms for handling the request. A second request will arrive while the first is still processing, so it's likely that Lambda will provision a second instance, and the second request will see a similar response time.
By the time the third request comes in (1800ms after the start) both instances are now idle and can be reused--so this request will not experience a cold start, and the response time will be 10ms. From this point forward it's likely that no additional instances are needed--but this all assumes a steady rate of requests.
But--changing any variable can have a big impact.


Is there an upper limit on the number of Publishers or Subscribers you can have in a Reactor Java Application?

In a Reactor Java based application we free up threads by creating Publishers to be subscribed on later.
Can I create unlimited number of Publishers to scale my Application? What are the constraints for creating Publishers.
You can create as many publishers as you like - nothing will happen until you subscribe (for a standard cold publisher at least), so until that point they're just like any other POJO. Obviously you have to have enough heap space, but that's about it.
If you're asking about the upper limit of active publishers (i.e. once you've subscribed) - that's a little more complex. In a traditional per-thread model the upper limit is generally the number of threads you can context switch between (often a few thousand.) Reactive frameworks essentially remove that limit by only using one event loop, which means the practical limit is often given by one of:
I/O bandwidth. If you have so many requests you don't have the network bandwidth to respond to them all in a timely fashion (or at all), then this is your limit.
CPU cycles. If you have plenty of network bandwidth, but your publishers are performing CPU intensive operations, then this is likely to be what you run into first - while you no longer have the overhead of context switching, it still takes a certain amount of CPU cycles to process each request, and you only have a finite resource here.
In reality though, both of the above differ per setup - so the only real way to know what limit you'll run into, and when you'll run into it, is to benchmark your application & hardware to find out.

Schedulers.elastic() not using early generated threads

I've written web-server on spring-boot-2.0 which uses netty under the hood.
This application uses Schedulers.elastic() threads.
When it started, about 100 elastic threads were created. These threads were rarely used and we've had few loading. But after a working day, the number of threads in elastic pool has increased to 1300. And now execution is on the elastic-1XXX, elastic-12XX threads, (name's numbers are above 100 and even 900).
Elastic, as I understand it, uses cachedThreadPool under the hood.
Why have new elastic threads been created and why has task switched to new threads?
What is the criteria for adding new trends?
And why haven't old threads (elastic-XX, elastic-1xx) been shutdown?
Without more information about the type of workload, the maximum concurrency, burst and average task duration, it’s really hard to tell if there’s a problem here.
By definition the elastic scheduler is creating an unbounded number of threads, as long as new tasks are added to the queue.
If the workload is bursty, with night concurrency at regular times, then it’s n unexpected to find a large number of threads. You could leverage the newElastic variants to reduce the TTL (default is 60s).
Again, without more information it’s hard to tell but your workload might not fit this scheduler. If your workload is CPU bound, the parallel scheduler is a better fit. The elastic one is tailored for IO/latency bound tasks.
The problem was: i was using Schedulers.elastic() for non-blocking operations, while there were no such operations . When i had removed elastic(), my service started to working correctly (without elastic's threads).

Java concurrency based on available FREE cpu

How do I scale to use more threads if and only if there is free cpu?
Something like a ThreadPoolExecutor that uses more threads when cpu cores are idle, and less or just one if not.
Current situation:
My Java server app processes requests and serves results.
There is a ThreadPoolExecutor to serve the requests with a reasonable number of max threads following the principle: number of cpu cores = number of max threads.
The work performed is cpu heavy, and there's some disk IO (DBs).
The code is linear, single threaded.
A single request takes between 50 and 500 ms to process.
Sometimes there are just a few requests per minute, and other times there are 30 simultaneous.
A modern server with 12 cores handles the load nicely.
The throughput is good, the latency is ok.
Desired improvement:
When there is a low number of requests, as is the case most of the time, many cpu cores are idle.
Latency could be improved in this case by running some of the code for a single request multi-threaded.
Some prototyping shows improvements, but as soon as I test with a higher number of concurrent requests,
the server goes bananas. Throughput goes down, memory consumption goes overboard.
30 simultaneous requests sharing a queue of 10 meaning that 10 can run at most while 20 are waiting,
and each of the 10 uses up to 8 threads at once for parallelism, seems to be too much for a machine
with 12 cores (out of which 6 are virtual).
This seems to me like a common use case, yet I could not find information by searching.
1) request counting
One idea is to count the current number of processed requests. If 1 or low then do more parallelism,
if high then don't do any and continue single-threaded as before.
This sounds simple to implement. Drawbacks are: request counter resetting must not contain bugs,
think finally. And it does not actually check available cpu, maybe another process uses cpu also.
In my case the machine is dedicated to just this application, but still.
2) actual cpu querying
I'd think that the correct approach would be to just ask the cpu, and then decide.
Since Java7 there is OperatingSystemMXBean.getSystemCpuLoad() see
but I can't find any webpage that mentions getSystemCpuLoad and ThreadPoolExecutor, or a similar
combination of keywords, which tells me that's not a good path to go.
The JavaDoc says "Returns the "recent cpu usage" for the whole system", and I'm wondering what
"recent cpu usage" means, how recent that is, and how expensive that call is.
I had left this question open for a while to see if more input is coming. Nope. Although I don't like the "no-can-do" answer to technical questions, I'm going to accept Holger's answer now. He has good reputation, good arguments, and others have approved his answer.
Myself I had experimented with idea 2 a bit. I queried the getSystemCpuLoad() in tasks to decide how large their own ExecutorService could be. As Holger wrote, when there is a SINGLE ExecutorService, resources can be managed well. But as soon as tasks start their own tasks, they cannot - it didn't work out for me.
There is no way of limiting based on “free CPU” and it wouldn’t work anyway. The information about “free CPU” is outdated as soon as you get it. Suppose you have twelve threads running concurrently and detecting at the same time that there is one free CPU core and decide to schedule a sub-task…
What you can do is limiting the maximum resource consumption which works quite well when using a single ExecutorService with a maximum number of threads for all tasks.
The tricky part is the dependency of the tasks on the result of the sub-tasks which are enqueued at a later time and might still be pending due to the the limited number of worker threads.
This can be adjusted by revoking the parallel execution if the task detects that its sub-task is still pending. For this to work, create a FutureTask for the sub-task manually and schedule it with execute rather than submit. Then proceed within the task as normally and at the place where you would perform the sub-task in a sequential implementation check whether you can remove the FutureTask from the ThreadPoolExecutor. Unlike cancel this works only if it has not started yet and hence is an indicator that there are no free threads. So if remove returns true you can perform the sub-task in-place letting all other threads perform tasks rather than sub-tasks. Otherwise, you can wait for the result.
At this place it’s worth noting that it is ok to have more threads than CPU cores if the tasks accommodate I/O operations (or may wait for sub-tasks). The important point here is to have a limit.
FutureTask<Integer> coWorker = new FutureTask<>(/* callable wrapping sub-task*/);
// proceed in the task’s sequence
if(executor.remove(coWorker));// do in-place if needed
// proceed
It sounds like the ForkJoinPool introduced in Java 7 would be exactly what you need. The ForkJoinPool is specifically designed to keep all your CPUs exactly busy meaning that there are as many threads as there are CPUs and that all those threads are also working and not blocking (For the later make sure that you use ManagedBlockers for DB queries).
In a ForkJoinTask there is the method getSurplusQueuedTaskCount for which the JavaDoc says "This value may be useful for heuristic decisions about whether to fork other tasks." and as such serves as a better replacement for your getSystemCpuLoad solution to make decisions about task decompositions. This allows you to reduce the number of decompositions when system load is high and thus reduce the impact of the task decomposition overhead.
Also see my answer here for some more indepth explanation about the principles of Fork/Join-pools.

How much more efficient is Java Google App Engine in threadsafe mode?

In Java Google App Engine you can turn on Concurrent Requests / Threadsafe mode:
The only reason to do this is that the Google servers will need to spin up fewer instances of your app to serve a given number of requests and therefore potentially save you money. Of course doing this will also mean you will have to write threadsafe code.
So the interesting question is: how much money does this tend to save? Has anyone attempted to measure it under some benchmark configuration / application functionality / load ?
This really depends on your code:
In single request mode, you can easily calculate requests per second: if a request on average takes 100ms to finish, then one instance will be able to perform 10 requests per second.
In concurrent request mode this is depends on two factors:
A. The type of instance you are using - AFAIK they are all the same you just get different number of cores. More cores means higher concurrent performance.
B. The ratio of CPU-bound code versus IO-bound code a request is performing. If your code is more IO-bound (= waiting for Datastore or other external service) then CPU will be able to run more of it in parallel.
In my app I see 15-20 rps at 200ms per request on the basic instance, so I could say that the factor between single-request and multi-request mode is about 3-4.

Minimum size for a piece of work to be benefically executed on another thread?

I have a low latency system that receives UDP messages. Depending on the message, the system responds by sending out 0 to 5 messages. Figuring out each possible response takes 50 us (microseconds), so if we have to send 5 responses, it takes 250 us.
I'm considering splitting the system up so that each possible response is calculated by a different thread, but I'm curious about the minimum "work time" needed to make that better. While I know I need to benchmark this to be sure, I'm interested in opinions about the minimum piece of work that should be done on a separate thread.
If I have 5 threads waiting on a signal to do 50 us of work, and they don't contend much, will the total time before all 5 are done be more or less than 250 us?
Passing data from one thread to another is very fast 1-4 us provided the thread is already running on the core. (and not sleep/wait/yielding) If your thread has to wake it can take 15 us but the task will also take longer as the cache is likely to have loads of misses. This means the task can take 2-3x longer.
Is that 50us compute-bound, or IO-bound ? If compute-bound, do you have multiple cores available to run these in parallel ?
Sorry - lots of questions, but your particular environment will affect the answer to this. You need to profile and determine what makes a difference in your particular scenario (perhaps run tests with differently size Threadpools ?).
Don't forget (also) that threads take up a significant amount of memory by default for their stack (by default, 512k, IIRC), and that could affect performance too (through paging requests etc.)
If you have more cores than threads, and if the threads are truly independent, then I would not be surprised if the multi-threaded approach took less than 250 us. Whether it does or not will depend on the overhead of creating and destroying threads. Your situation seems ideal, however.

