Java concurrency based on available FREE cpu

Java concurrency based on available FREE cpu - java

QUESTION
How do I scale to use more threads if and only if there is free cpu?
Something like a ThreadPoolExecutor that uses more threads when cpu cores are idle, and less or just one if not.
USE CASE
Current situation:
My Java server app processes requests and serves results.
There is a ThreadPoolExecutor to serve the requests with a reasonable number of max threads following the principle: number of cpu cores = number of max threads.
The work performed is cpu heavy, and there's some disk IO (DBs).
The code is linear, single threaded.
A single request takes between 50 and 500 ms to process.
Sometimes there are just a few requests per minute, and other times there are 30 simultaneous.
A modern server with 12 cores handles the load nicely.
The throughput is good, the latency is ok.
Desired improvement:
When there is a low number of requests, as is the case most of the time, many cpu cores are idle.
Latency could be improved in this case by running some of the code for a single request multi-threaded.
Some prototyping shows improvements, but as soon as I test with a higher number of concurrent requests,
the server goes bananas. Throughput goes down, memory consumption goes overboard.
30 simultaneous requests sharing a queue of 10 meaning that 10 can run at most while 20 are waiting,
and each of the 10 uses up to 8 threads at once for parallelism, seems to be too much for a machine
with 12 cores (out of which 6 are virtual).
This seems to me like a common use case, yet I could not find information by searching.
IDEAS
1) request counting
One idea is to count the current number of processed requests. If 1 or low then do more parallelism,
if high then don't do any and continue single-threaded as before.
This sounds simple to implement. Drawbacks are: request counter resetting must not contain bugs,
think finally. And it does not actually check available cpu, maybe another process uses cpu also.
In my case the machine is dedicated to just this application, but still.
2) actual cpu querying
I'd think that the correct approach would be to just ask the cpu, and then decide.
Since Java7 there is OperatingSystemMXBean.getSystemCpuLoad() see http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getSystemCpuLoad()
but I can't find any webpage that mentions getSystemCpuLoad and ThreadPoolExecutor, or a similar
combination of keywords, which tells me that's not a good path to go.
The JavaDoc says "Returns the "recent cpu usage" for the whole system", and I'm wondering what
"recent cpu usage" means, how recent that is, and how expensive that call is.
UPDATE
I had left this question open for a while to see if more input is coming. Nope. Although I don't like the "no-can-do" answer to technical questions, I'm going to accept Holger's answer now. He has good reputation, good arguments, and others have approved his answer.
Myself I had experimented with idea 2 a bit. I queried the getSystemCpuLoad() in tasks to decide how large their own ExecutorService could be. As Holger wrote, when there is a SINGLE ExecutorService, resources can be managed well. But as soon as tasks start their own tasks, they cannot - it didn't work out for me.

There is no way of limiting based on “free CPU” and it wouldn’t work anyway. The information about “free CPU” is outdated as soon as you get it. Suppose you have twelve threads running concurrently and detecting at the same time that there is one free CPU core and decide to schedule a sub-task…
What you can do is limiting the maximum resource consumption which works quite well when using a single ExecutorService with a maximum number of threads for all tasks.
The tricky part is the dependency of the tasks on the result of the sub-tasks which are enqueued at a later time and might still be pending due to the the limited number of worker threads.
This can be adjusted by revoking the parallel execution if the task detects that its sub-task is still pending. For this to work, create a FutureTask for the sub-task manually and schedule it with execute rather than submit. Then proceed within the task as normally and at the place where you would perform the sub-task in a sequential implementation check whether you can remove the FutureTask from the ThreadPoolExecutor. Unlike cancel this works only if it has not started yet and hence is an indicator that there are no free threads. So if remove returns true you can perform the sub-task in-place letting all other threads perform tasks rather than sub-tasks. Otherwise, you can wait for the result.
At this place it’s worth noting that it is ok to have more threads than CPU cores if the tasks accommodate I/O operations (or may wait for sub-tasks). The important point here is to have a limit.
FutureTask<Integer> coWorker = new FutureTask<>(/* callable wrapping sub-task*/);
executor.execute(coWorker);
// proceed in the task’s sequence
if(executor.remove(coWorker)) coWorker.run();// do in-place if needed
subTaskResult=coWorker.get();
// proceed

It sounds like the ForkJoinPool introduced in Java 7 would be exactly what you need. The ForkJoinPool is specifically designed to keep all your CPUs exactly busy meaning that there are as many threads as there are CPUs and that all those threads are also working and not blocking (For the later make sure that you use ManagedBlockers for DB queries).
In a ForkJoinTask there is the method getSurplusQueuedTaskCount for which the JavaDoc says "This value may be useful for heuristic decisions about whether to fork other tasks." and as such serves as a better replacement for your getSystemCpuLoad solution to make decisions about task decompositions. This allows you to reduce the number of decompositions when system load is high and thus reduce the impact of the task decomposition overhead.
Also see my answer here for some more indepth explanation about the principles of Fork/Join-pools.

Related

Java. Difference between Thread.sleep() and ScheduledExecutorService methods [duplicate]

Goal: Execute certain code every once in a while.
Question: In terms of performance, is there a significant difference between:
while(true) {
execute();
Thread.sleep(10 * 1000);
}
and
executor.scheduleWithFixedDelay(runnableWithoutSleep, 0, 10, TimeUnit.SECONDS);
?
Of course, the latter option is more kosher. Yet, I would like to know whether I should embark on an adventure called "Spend a few days refactoring legacy code to say goodbye to Thread.sleep()".
Update:
This code runs in super/mega/hyper high-load environment.

You're dealing with sleep times termed in tens of seconds. The possible savings by changing your sleep option here is likely nanoseconds or microseconds.
I'd prefer the latter style every time, but if you have the former and it's going to cost you a lot to change it, "improving performance" isn't a particularly good justification.
EDIT re: 8000 threads
8000 threads is an awful lot; I might move to the scheduled executor just so that you can control the amount of load put on your system. Your point about varying wakeup times is something to be aware of, although I would argue that the bigger risk is a stampede of threads all sleeping and then waking in close succession and competing for all the system resources.
I would spend the time to throw these all in a fixed thread pool scheduled executor. Only have as many running concurrently as you have available of the most limited resource (for example, # cores, or # IO paths) plus a few to pick up any slop. This will give you good throughput at the expense of latency.
With the Thread.sleep() method it will be very hard to control what is going on, and you will likely lose out on both throughput and latency.
If you need more detailed advice, you'll probably have to describe what you're trying to do in more detail.

Since you haven't mentioned the Java version, so, things might change.
As I recall from the source code of Java, the prime difference that comes is the way things are written internally.
For Sun Java 1.6 if you use the second approach the native code also brings in the wait and notify calls to the system. So, in a way more thread efficient and CPU friendly.
But then again you loose the control and it becomes more unpredictable for your code - consider you want to sleep for 10 seconds.
So, if you want more predictability - surely you can go with option 1.
Also, on a side note, in the legacy systems when you encounter things like this - 80% chances there are now better ways of doing it- but the magic numbers are there for a reason(the rest 20%) so, change it at own risk :)

There are different scenarios,
The Timer creates a queue of tasks that is continually updated. When the Timer is done, it may not be garbage collected immediately. So creating more Timers only adds more objects onto the heap. Thread.sleep() only pauses the thread, so memory overhead would be extremely low
Timer/TimerTask also takes into account the execution time of your task, so it will be a bit more accurate. And it deals better with multithreading issues (such as avoiding deadlocks etc.).
If you thread get exception and gets killed, that is a problem. But TimerTask will take care of it. It will run irrespective of failure in previous run
The advantage of TimerTask is that it expresses your intention much better (i.e. code readability), and it already has the cancel() feature implemented.
Reference is taken from here

You said you are running in a "mega... high-load environment" so if I understand you correctly you have many such threads simultaneously sleeping like your code example. It takes less CPU time to reuse a thread than to kill and create a new one, and the refactoring may allow you to reuse threads.
You can create a thread pool by using a ScheduledThreadPoolExecutor with a corePoolSize greater than 1. Then when you call scheduleWithFixedDelay on that thread pool, if a thread is available it will be reused.
This change may reduce CPU utilization as threads are being reused rather than destroyed and created, but the degree of reduction will depend on the tasks they're doing, the number of threads in the pool, etc. Memory usage will also go down if some of the tasks overlap since there will be less threads sitting idle at once.

How to determine optimal number of threads for high latency network requests?

I am writing a utility that must make thousands of network requests. Each request receives only a single, small packet in response (similar to ping), but may take upwards of several seconds to complete. Processing each response completes in one (simple) line of code.
The net effect of this is that the computer is not IO-bound, file-system-bound, or CPU-bound, it is only bound by the latency of the responses.
This is similar to, but not the same as There is a way to determine the ideal number of threads? and Java best way to determine the optimal number of threads [duplicate]... the primary difference is that I am only bound by latency.
I am using an ExecutorService object to run the threads and a Queue<Future<Integer>> to track threads that need to have results retrieved:
ExecutorService executorService = Executors.newFixedThreadPool(threadPoolSize);
Queue<Future<Integer>> futures = new LinkedList<Future<Integer>>();
for (int quad3 = 0 ; quad3 < 256 ; ++quad3) {
for (int quad4 = 0 ; quad4 < 256 ; ++quad4) {
byte[] quads = { quad1, quad2, (byte)quad3, (byte)quad4 };
futures.add(executorService.submit(new RetrieverCallable(quads)));
}
}
... I then dequeue all the elements in the queue and put the results in the required data structure:
int[] result = int[65536]
while(!futures.isEmpty()) {
try {
results[i] = futures.remove().get();
} catch (Exception e) {
addresses[i] = -1;
}
}
My first question is: Is this a reasonable way to track all the threads? If thread X takes a while to complete, many other threads might finish before X does. Will the thread pool exhaust itself waiting for open slots, or will the ExecutorService object manage the pool in such a way that threads that have completed but not yet been processed be moved out of available slots so that other threads my begin?
My second question is what guidelines can I use for finding the optimal number of threads to make these calls? I don't even know order-of-magnitude guidance here. I know it works pretty well with 256 threads, but seems to take roughly the same overall time with 1024 threads. CPU utilization is hovering around 5%, so that doesn't appear to be an issue. With that large a number of threads, what are all the metrics I should be looking at to compare different numbers? Obviously overall time to process the batch, average time per thread... what else? Is memory an issue here?

It will shock you, but you do not need any threads for I/O (quantitatively, this means 0 threads). It is good that you have studied that multithreading does not multiply your network bandwidth. Now, it is time to know that threads do computation. They are not doing the (high-latency) communication. The communication is performed by a network adapter, which is another process, running really in parallel with with CPU. It is stupid to allocate a thread (see which resources allocated are listed by this gentlemen who claims that you need 1 thread) just to sleep until network adapter finishes its job. You need no threads for I/O = you need 0 threads.
It makes sense to allocate the threads for computation to make in parallel with I/O request(s). The amount of threads will depend on the computation-to-communication ratio and limited by the number of cores in your CPU.
Sorry, I had to say that despite you have certainly implied the commitment to blocking I/O, so many people do not understand this basic thing. Take the advise, use asynchronous I/O and you'll see that the issue does not exist.

As mentioned in one of the linked answers you refer to, Brian Goetz has covered this well in his article.
He seems to imply that in your situation you would be advised to gather metrics before committing to a thread count.
Tuning the pool size
Tuning the size of a thread pool is largely a matter of avoiding two mistakes: having too few threads or too many threads. ...
The optimum size of a thread pool depends on the number of processors available and the nature of the tasks on the work queue. ...
For tasks that may wait for I/O to complete -- for example, a task that reads an HTTP request from a socket -- you will want to increase the pool size beyond the number of available processors, because not all threads will be working at all times. Using profiling, you can estimate the ratio of waiting time (WT) to service time (ST) for a typical request. If we call this ratio WT/ST, for an N-processor system, you'll want to have approximately N*(1+WT/ST) threads to keep the processors fully utilized.
My emphasis.

Have you considered using Actors?
Best practises.
Actors should be like nice co-workers: do their job efficiently
without bothering everyone else needlessly and avoid hogging
resources. Translated to programming this means to process events and
generate responses (or more requests) in an event-driven manner.
Actors should not block (i.e. passively wait while occupying a Thread)
on some external entity—which might be a lock, a network socket,
etc.—unless it is unavoidable; in the latter case see below.
Sorry, I can't elaborate, because haven't much used this.
UPDATE
Answer in Good use case for Akka might be helpful.
Scala: Why are Actors lightweight?

Pretty sure in the described circumstances, the optimal number of threads is 1. In fact, that is surprisingly often the answer to any quesion of the form 'how many threads should I use'?
Each additonal thread adds extra overhead in terms of stack (and associated GC roots), context switching and locking. This may or not be measurable: the effor to meaningfully measure it in all target envoronments is non-trivial. In return, there is little scope to provide any benifit, as processing is neither cpu nor io-bound.
So less is always better, if only for reasons of risk reduction. And you cant have less than 1.

I assume the desired optimization is the time to process all requests. You said the number of requests is "thousands". Evidently, the fastest way is to issue all requests at once, but this may overflow the network layer. You should determine how many simultaneous connections can network layer bear, and make this number a parameter for your program.
Then, spending a thread for each request require a lot of memory. You can avoid this using non-blocking sockets. In Java, there are 2 options: NIO1 with selectors, and NIO2 with asynchronous channels. NIO1 is complex, so better find a ready-made library and reuse it. NIO2 is simple but available only since JDK1.7.
Processing the responses should be done on a thread pool. I don't think the number of threads in the thread pool greatly affects the overall performance in your case. Just make tuning for thread pool size from 1 to the number of available processors.

In our high-performance systems, we use the actor model as described by #Andrey Chaschev.
The no. of optimal threads in your actor model differ with your CPU structure and how many processes (JVMs) do you run per box. Our finding is
If you have 1 process only, use total CPU cores - 2.
If you have multiple process, check your CPU structure. We found its good to have no. of threads = no. of cores in a single CPU - e.g. if you have a 4 CPU server each server having 4 cores, then using 4 threads per JVM gives you best performance. After that, always leave at least 1 core to your OS.

An partial answer, but I hope it helps. Yes, memory can be an issue: Java reserves 1 MB of thread stack by default (at least on Linux amd64). So with a few GB of RAM in your box, that limits your thread count to a few thousand.
You can tune this with a flag like -XX:ThreadStackSize=64. That would give you 64 kB, which is plenty in most situations.
You could also move away from threading entirely and use epoll to respond to incoming responses. This is far more scalable but I have no practical experience with doing this in Java.

busy spin to reduce context switch latency (java)

In my application there are several services that process information on their own thread, when they are done they post a message to the next service which then continue to do its work on its own thread. The handover of messages is done via a LinkedBlockingQueue. The handover normally takes 50-80 us (from putting a message on the queue until the consumer starts to process the message).
To speed up the handover on the most important services I wanted to use a busy spin instead of a blocking approach (I have 12 processor cores and want to dedicate 3 to these important services).
So.. I changed LinkedBlockingQueue to ConcurrentLinkedQueue
and did
for(;;)
{
Message m = queue.poll();
if( m != null )
....
}
Now.. the result is that the first message pass takes 1 us, but then the latency increases over the next 25 handovers until reaches 500 us and then the latency is suddenly back to 1 us and the starts to increase.. So I have latency cycles with 25 iterations where latency starts at 1 us and ends at 500 us. (message are passed approximately 100 times per second)
with an average latency of 250 it is not exactly the performance gain I was looking for.
I also tried to use the LMAX Disruptor ringbuffer instead of the ConcurrentLinkedQueue. That framwork have its own build in busy spin implementation and a quite different queue implementation, but the result was the same. So im quite certain that its not the fault of the queue or me abusing something..
Question is.. What the Heck is going on here? Why am I seeing this strange latency cycles?
Cheers!!

As far as I know thread scheduler can deliberately pause a thread for a longer time if it detects that this thread is using CPU quite intensively - to distribute CPU time between different threads fairer. Try adding LockSupport.park() in the consumer after queue is empty and LockSupport.unpark() in the producer after adding the message - it might make the latency less variable; whether it will actually be better comparing to blocking queue is a big question though.

If you really need doing the job the way you described (and not the way Andrey Nudko replied at Jan 5 at 13:22), then you definitedly need looking at the problem also from other viewpoints.
Just some hints:
Try checking your overall environment (outside the JVM). For example:
OS CPU scheduler has a huge impact on this..currently the default is very likely
http://en.wikipedia.org/wiki/Completely_Fair_Scheduler
number of running processes, etc.
"problems" inside your JVM
garbage collector (try different one: http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html#1.1.%20Types%20of%20Collectors%7Coutline)
Try changing thread priorities: Setting priority to Java's threads

This is just wild speculation (since as others have mentioned, you're not gathering any information on the queue length, failed polls, null polls etc.):
I used the force and read the source of ConcurrentLinkedQueue, or rather, briefly leafed through it for a minute or two. The polling is not quite your trivial O(1) operation. It might be the case that you're traversing more than a few nodes which have become stale, holding null; and there might be additional transitory states involving nodes linking to themselves as the next node as indication of staleness/removal from the queue. It may be that the queue is starting to build up garbage due to thread scheduling. Try following the links to the abstract algorithm mentioned in the code:
Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue by Maged M. Michael and Michael L. Scott (link has a PDF and pseudocode).

Here is my 2 cents. If you are running on linux/unix based systems, there is a way to dedicate a certain cpu to a certain thread. In essence, you can make the OS ignore that cpu for any scheduling. Checkout the isolation levels for cpu

Is concurrent programming more grided or clustered?

I'm trying to wrap my brain around parallel/concurrent programming (in Java) and am getting hung up on some fundamentals that don't seem to be covered in any of the tutorials I've been reading.
When we talk about "multi-threading", or "parallel/concurrent programming", does that mean we're taking a big problem and spreading it over many threads, or are we first explicitly decomposing it into smaller sub-problems, and passing each sub-problem to its own thread?
For example, let's say we have EndWorldHungerTask implements Runnable, and task accomplishes some enormous problem. In order to complete its objective, it has to do some really heavy lifting, say, a hundred million times:
public class EndWorldHungerTask implements Runnable {
public void run() {
for(int i = 0; i < 100000000; i++)
someReallyExpensiveOperation();
}
}
In order to make this "concurrent" or "multi-threaded", would we pass this EndWorldHungerTask to, say, 100 worker threads (where each of the 100 workers are told by the JVM when to be active and work on the next iteration/someReallyExpensiveOperation() call), or would we refactor it manually/explicitly so that each of the 100 workers is iterating over different parts of the loop/work-to-be-done? In both cases, each of the 100 workers is only iterating a million times.
But, under the first paradigm, Java is telling each Thread when to execute. Under the second, the developer needs to manually (in the code) partition the problem ahead of time, and assign each sub-problem to a new Thread.
I guess I'm asking how its "normally done" in Java land. And, not just for this problem, but in general.

I guess I'm asking how its "normally done" in Java land. And, not just for this problem, but in general.
This is highly dependent on the task at hand.
The standard paradigm in Java is that you have to split the work into chunks yourself. Distributing those chunks across multiple threads/cores is a separate problem, and there exist a variety of patterns for that (queues, thread pools, etc).
It is interesting to note that there exist frameworks that can automatically make use of multiple cores to execute things like for loops in parallel (for example, OpenMP). However, I am not aware of any such frameworks for Java.
Finally, it could be the case that the low-level library that does the bulk of the work can make use of multiple cores. In such a case the higher-level code may be able to remain single-threaded and still benefit from multicore hardware. One example might be numerical code using MKL under the covers.

When we talk about "multi-threading", or "parallel/concurrent programming", does that mean we're taking a big problem and spreading it over many threads, or are we first explicitly decomposing it into smaller sub-problems, and passing each sub-problem to its own thread?
I think this depends highly on the problem. There are times where you have the same task that you call 1000s or millions of times using the same code. This is the ExecutorSerivce.submit() type of pattern. You has million of lines from a file and you are running some processing methods on each line. I guess this is your "spreading it over many threads" type of problem. This works for simple thread models.
But there are other cases where the problem space is made up of a large number of non-homogenous tasks. Sometimes you might spawn a single thread to handle some background keep-alive, and other times a thread pool here and there to process some queue of work. Typically the larger the scope of the problem, the more complicated the concurrency model and the more different types of pools and threads are used. I guess this is your "decomposing it into smaller sub-problems" type.
In order to make this "concurrent" or "multi-threaded", would we pass this EndWorldHungerTask to, say, 100 worker threads (where each of the 100 workers are told by the JVM when to be active and work on the next iteration/someReallyExpensiveOperation() call), or would we refactor it manually/explicitly so that each of the 100 workers is iterating over different parts of the loop/work-to-be-done? In both cases, each of the 100 workers is only iterating a million times.
In your case, I don't see how you can solve world hunger (to use your analogy) with one set of thread code. I think that you have to "decompose it into smaller sub-problems" which corresponds to the latter case that I explain above: a whole series of threads running different code. Some of the sub-solutions can be done in thread-pools and some will be done with individual threads, each running separate code.
I guess I'm asking how its "normally done" in Java land. And, not just for this problem, but in general.
"Normally" depends highly on the problem and its complexity. In my experience, I normally use the ExecutorService constructs as much as possible. But with any decent sized problem you will find yourself with a number of different thread-pools, Spring timer threads, custom one-off thread tasks, producer/consumer models, etc., etc..

Normally you would want each thread to execute one task form start to finish, you would gain nothing from leaving the task half done, then halting execution on that thread and "calling" another thread to finish the job. Java offers of course tools for this kind of thread synchronization, but they are really used when a task is depending on another task to complete - not so that another thread may complete the task.
Most of the time you will have a big problem, that consists of several tasks, if this tasks can be executed concurrently then it would make sense to spawn threads to execute this tasks. There is an overhead associated with creating threads, so if all the tasks are sequential and must wait for the other to finish, then it would not be beneficial at all to spawn multiple threads, just one thread so you don't block the main thread.

"multi-threading" <> "parallel/concurrent programming".
Multithreaded apps are often written to take advantage of the high I/O performance of a preemptive multitasker. An example might be a web crawler/downloader. A multithreaded crawler would typically outperform a single-threaded version by a huge factor, even when running on a box with only one CPU core. The actions of a DNS query to get a site address, connecting to the site, downloading a page, writing it to a disk file are all operations that require little CPU but a lot of IO waiting. So, a lot of these unavoidable waits can be performed in parallel by many threads. When a DNS query comes in, an HTTP client connects or a disk operation is complete, the thread that requested it is made ready/running and can move on to the next operation.
The vast majority of apps are, primarily, written as multithreaded for this reason. That's why the box I'm writing this on has 98 processes, (of which 94 have more than one thread), 1360 threads and 3% CPU use - it's got little to do with splitting CPU work up across cores - it's mostly about IO performance.
Parallel/concurrent programming can actually take place with multiple CPU cores. For those apps that have CPU-intensive work that can be decomposed into largish packages for distribution across cores, a speedup factor approaching the number of cores is possible with care.
Naturally there is some bleedover - the I/O bound web-crawler will tend to perform better on a box with more cores, if only because the interrupt/driver overhead has a smaller impact on overall performance, but it wont be better by much.
It doesn't matter how many workers you have available for the EndWorldHunger Task if they are all waiting for the crops to grow.

Minimum size for a piece of work to be benefically executed on another thread?

I have a low latency system that receives UDP messages. Depending on the message, the system responds by sending out 0 to 5 messages. Figuring out each possible response takes 50 us (microseconds), so if we have to send 5 responses, it takes 250 us.
I'm considering splitting the system up so that each possible response is calculated by a different thread, but I'm curious about the minimum "work time" needed to make that better. While I know I need to benchmark this to be sure, I'm interested in opinions about the minimum piece of work that should be done on a separate thread.
If I have 5 threads waiting on a signal to do 50 us of work, and they don't contend much, will the total time before all 5 are done be more or less than 250 us?

Passing data from one thread to another is very fast 1-4 us provided the thread is already running on the core. (and not sleep/wait/yielding) If your thread has to wake it can take 15 us but the task will also take longer as the cache is likely to have loads of misses. This means the task can take 2-3x longer.

Is that 50us compute-bound, or IO-bound ? If compute-bound, do you have multiple cores available to run these in parallel ?
Sorry - lots of questions, but your particular environment will affect the answer to this. You need to profile and determine what makes a difference in your particular scenario (perhaps run tests with differently size Threadpools ?).
Don't forget (also) that threads take up a significant amount of memory by default for their stack (by default, 512k, IIRC), and that could affect performance too (through paging requests etc.)

If you have more cores than threads, and if the threads are truly independent, then I would not be surprised if the multi-threaded approach took less than 250 us. Whether it does or not will depend on the overhead of creating and destroying threads. Your situation seems ideal, however.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.