When To Create New ForkJoinPool and When To Use CommonPool?

When To Create New ForkJoinPool and When To Use CommonPool? - java

I was reading on threading and learned about fork/join API.
I found that you can either run threads with the commonPool being the default pool managing the threads, or I can submit the threads to a newly created ForkJoinPool.
The difference between the two is as follows, to my understanding:
The commonPool is the main pool created statically (where some pool methods don't work as they normally do with other pools like shutting it down), and is used mainly for the application to run.
The number of parallelism in the default/commonPool is the number of cores - 1, where the default number of parallelism of a newly created pool = number of cores (or the number specified by system property parallelism - I'm ignoring the fully qualified system property key name -).
Based on the documentation, the commonPool is fine for most uses.
This all boils down to my question:
When should I use the common pool? And why so? When should I create a new pool? And why so?

Short Story
The answer, like most things in software engineering, is: "It depends".
Pros of using the common pool
If you look at this wonderful article:
According to Oracle’s documentation, using the predefined common pool reduces resource consumption, since this discourages the creation of a separate thread pool per task.
and
Using the fork/join framework can speed up processing of large tasks,
but to achieve this outcome, some guidelines should be followed:
Use as few thread pools as possible – in most cases, the best decision
is to use one thread pool per application or system
Use the default
common thread pool, if no specific tuning is needed
Use a reasonable threshold for splitting ForkJoingTask into subtasks
Avoid any blocking in your ForkJoingTasks
Pros of using dedicated pools
However, there are also some arguments AGAINST following this approach:
Dedicated Pool for Complex Applications
Having a dedicated pool per logical working unit in a complex application is sometimes the preferred approach. Imagine an application that:
Takes in a lot of events and groups them (that can be done in parallel)
Then workers do the work (that can be done in parallel as well)
Finally, some cleanup workers do some cleanup (that can be done in parallel as well).
So your application has 3 logical work groups each of which might have its own demands for parallelism. (Keep in mind that this pool has parallelism set to something fairly low on most machines)
Better to not step on each other's toes, right? Note that this can scale up to a certain level, where it's recommended to have a separate microservice for each of these work units, but if for one reason or another you are not there already, then a dedicated forkJoinPool per logical work unit is not a bad idea.
Other libraries
If your app's code has only one place where you want parallelism, you don't have a guarantee that some developer wouldn't pull some 3-rd party dependency which also relies on the common ForkJoinPool, and you still have two places where this pool is in demand. That might be okay for your use case, and it might not be, especially if your default pool's parallelism is 4 or below.
Imagine the situation when your app critical code (e.g event handling or saving data to a database) is having to compete for the common pool with some library which exports logs in parallel to some log sink.
Dedicated ForkJoinPool Makes Logging Neater
Additionally, the common forkJoinPool has a rather non-descriptive naming so if you are debugging or looking at logs, chances are you will have to sift through a ton of
ForkJoinPool.commonPool-worker-xx
In the situation described above, compare that with:
ForkJoinPool.grouping-worker-xx
ForkJoinPool.payload-handler-worker-xx
ForkJoinPool.cleanup-worker
Therefore you can see there is some benefit in logging cleanness when using a dedicated ForkJoinPool per logical work group.
TL;DR
Using the common ForkJoinPool has lower memory impact, less resources and thread creation and lower garbage collection demands. However, this approach might be insufficient for some use cases, as pointed above.
Using a dedicated ForkJoinPool per logical work unit in your application provides neater logging, is not a bad idea to use when you have low parallelism level (i.e not many cores), and when you want to avoid thread contention between logically different parts of your application. This, however, comes at a price of higher cpu utilization, higher memory overhead, and more thread creation.

Related

Shard resources by thread?

I have a (limited) thread pool which executes CPU-bound tasks. I'd like to aggregate some numerical statistics from each of these threads in a single place. Basically: each thread will update some shared stats (e.g. how long its job took) at a very high frequency and, at some much slower interval, a 'stat reader' would query those stats.
My first thought was to use some shared atomics and update them from each thread. This works ok, but in my testing the overhead of the atomics can get pretty high with a lot of contention so I was trying to think of some other alternatives.
My second though was a sort of 'sharding' scheme, where each thread had its own stats object that it could update without requiring any synchronization. The 'stat reader' could then aggregate the stats from each thread into an overall stat value.
My first question is: does the thread sharding scheme make sense? Does something like that exist that I'm reinventing?
My second question is: if the sharding scheme does make sense, I'm trying to think of the best way to map threads to their shard:
1) Use the thread's ID mod some shard value to get a shard index, but I don't think that's reliable as I think the thread id value is shared, so I could get a collision.
2) Adding a thread-local index to the thread, but I don't think that will play nicely with the ExecutorService.
3) I could subclass Thread, but then I'd have to cast it when I wanted to access this which I'd rather avoid, if possible.
4) When the thread is created, create a mapping of its name to its shard. This would work, but there would be a race when creating the threads: one could be looking up its shard while we're adding a new shard to the map, causing concurrency issues.
Wondering if I'm way off-base here and overthinking it (seems like it would be a common problem?) or if one of these schemes does make sense for the use case.

One way to solve this is to use the LongAdder class that avoids the contention that plain old atomics suffer from.
A more hand-written approach would be to create some class that holds the statistics you want to gather for each thread, and then have an array of these objects such that each thread's stats object is in array[thread.getId() % NUM_THREADS]. The reader thread can then traverse the array and gather the stats as it pleases.
The trick to getting this to work efficiently is to avoid false sharing. That is, threads on different cores perform updates on their respective objects but those objects happen to reside on the same cacheline, causing massive amounts of unnecessary cache coherence traffic.
In Java 8, there is the #Contended annotation that you might want to look into. The old way of padding your class with a bunch of long fields doesn't work anymore since unused fields will be optimized away.

I would suggest you use different way: Actor.
The actor model provides a relatively simple but powerful model for designing and implementing applications that can distribute and share work across all system resources—from threads and cores to clusters of servers and data centers. It provides an effective framework for building applications with high levels of concurrency and for increasing levels of resource efficiency. Importantly, the actor model also has well-defined ways for handling errors and failures gracefully, ensuring a level of resilience that isolates issues and prevents cascading failures and massive downtime.
You can turn to Akka i think.

Is using different thread pools for different types of tasks worth the overhead?

I'm designing a class that provides statistical information about groups of Collatz sequences. One of my goals is to be able to process a large number of sequences containing enormous terms (on the scale of hundreds or even thousands of digits) simultaneously, with maximum efficiency.
To this end, I plan on using the best data collection technique for each individual statistic, which means some tasks may be more efficiently dealt with by a ForkJoinPool, others by the standard cached and fixed thread pools provided in Executors. Would the overhead of creating multiple thread pools, or shutting one down and creating another, if I went that route, cost me more than I would save?

Would the overhead of creating multiple thread pools, or shutting one down and creating another, if I went that route, cost me more than I would save?
How could we possibly tell you that?
There is definitely an overhead in shutting down and restarting a thread pool. If any kind. Creating threads is not cheap.
However, we have no way of quantifying how much you save by using different kinds of thread pool. If we can't quantify that it is impossible to advise you on whether your strategy will work ... or not.
(But I think that repeatedly shutting down and recreating thread pools would be a bad idea. The performance impact of an idle pool is minimal.)
This "smells" of premature optimization. (It is like trying to tune the engine of a racing car before you have manufactured the engine block!)
My advice would be to (largely1) forget about performance to start with. For now, focus on getting something that works. Here's what I would do:
Implement the code using the easiest strategy, write test cases, test / debug until it works.
Choose a sample problem or set of problems that is typical of the kind you will be trying to solve
Implement a test harness that allows you to measure the code's performance for the sample problems. (Beware of the standard problems with Java benchmarking ...)
Benchmark your code.
Is it fast enough? Stop NOW.
If not, continue.
Implement one of the alternative strategies, and test / debug.
Benchmark the modified code.
Is it fast enough? Stop NOW.
Is it clear that it doesn't help?. Abandon it, and try another strategy.
Can you tweak it? If so, try that.
Go to 5.
Also, it may be worthwhile implementing the different strategies in such a way that you can tune them or switch between them using command line or config file settings.
As a general rule, it is hard to determine a priori how well any complicated algorithm or strategy is going to perform. Generally speaking, there are too many factors to take into account for a theoretical ... or intuitive ... approach to give a reliable prediction. Benchmarking and tuning is the way to go.
1 - Obviously, if you know that some technique or algorithm will perform badly, and you have a better alternative that is about the same effort to implement ... do the sensible thing.

Since you are only talking about two different types of pools (fork-join and Executor based pools), and you claim that at least some of your tasks are more suited to one type or pool or the other, it is entirely likely that the overhead of using two types of pools is worth it.
After all, you can just keep both types of pools alive and so there is only a one time cost to setting up the pools and creating the threads, while the (apparent) benefit of the two pool types will apply across the entirety of your processing. Since you are doing an "enormous" amount of work even small benefits will eventually add up and overwhelm the one-time costs (which are probably measured in micro-architecture per thread).
Key to this observation is that there is no real ongoing overhead for existing but inactive threads in the pool you aren't using.
Of course, that said, the short answer it "just try both approaches and measure it!".

Migrating expensive to initialize java.util.concurrent.Callables to Apache Spark

I need to migrate a Java program to Apache Spark. The current Java heavily utilizes the functionality provided by java.util.concurrent and runs on a single machine. Since the initialization of a worker (Callable) is expensive, the workers are reused again and again - i.e. a worker reinserts itself into the pool once it terminates and has returned its result.
More precise:
The current implementation works on small data sets in the range of 10E06 entries/few GBs.
The data contains entries that can be processed independently. That is, one could fire up one worker per task and submit it to the java thread pool.
However, setting up a worker for processing an entry involves loading more data in, building graphs... all together some GB AND cpu time in the range of minutes.
Some data can indeed be shared among the workers e.g. some look-up tables but does not need to. Some data is private to the worker and thus not shared. The worker may change the data while processing the entry and only later reset it in a fast manner, e.g. caches specific to the entry currently processed. Thus, the worker can reinsert itself in the pool and start working on the next entry without going though the expensive initialization.
Runtime per worker and entry is in the range of seconds.
The workers hand back their results via an ExecutorCompletionService, i.e. the results are later retrieved by calling pool.take().get() in a central part of the program.
Getting to know Apache Spark I find most examples just use standard transformations and actions. I also find examples that add their own functions to the DAG by extending the API. Still, those examples all stick to simple lightweight calculations and come without initialization cost.
I now wonder what is the best approach to design a Spark application that reuses some kind of "heavy worker". The executors seem to be the only persistent entities that could possibly hold a pool of such workers. However, being new to the world of Spark I most likely miss some point...
edited 20161007
Found an answer that points to a (possible) solution using Functions. So the question is, can I
Split my partition according to the number of executors,
Each executor gets exactly one partition to work on
My Function (called setup in the linked solution) creates a thread pool and reuses the workers
A separate combine function later merges the results

Your current architecture is a monolithic, multi-threaded architecture with shared state between the threads. Given that the size of your dataset is relatively modest for modern hardware you can parallelize it quite easily with Spark, where you will replace the threads with executors in the cluster's nodes.
From your question I understand that your two main concerns is whether Spark can handle complex parallel computations and how to share the necessary bits of state in a distributed environment.
Complicated business logic: Regarding the first part, you can run arbitrarily complicated business logic in the Spark Executors, which are the equivalent of the worker threads in your current architecture.
This blog post from cloudera explains well the concept along with other important concepts of the execution model:
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
One aspect you will need to pay attention to it though, is the configuration of your Spark job, in order to avoid timeouts due to Executors taking too long to finish, which may be expected for an application with complicated business logic like yours.
Refer to the excellent page from DataBricks for more details, and more specifically to the execution behavior:
http://spark.apache.org/docs/latest/configuration.html#execution-behavior
Shared state: You can share complicated data structures like graphs and application configuration in Spark among the nodes. One approach which works well is Broadcast Variables, where a copy of the state to be distributed is distributed to every node. Below are some very nice explanations of the concept:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-broadcast.html
http://g-chi.github.io/2015/10/21/Spark-why-use-broadcast-variables/
This will shave the latency from your application, while ensuring data locality.
The processing of your data can be performed on a partition based (more here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-partitions.html), with the results aggregated on the driver or with the use of Accumulators (more here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-accumulators.html). In case the resulted data are complicated, the partition approach may work better and also gives you more fine grained control over your applications execution.
Regarding the hardware resource requirements, it seems that your application needs a few Gigabytes for the shared state, which will need to stay in memory and additionally a few more Gigabytes for the data in every node. You can set the persistence model to MEMORY_AND_DISK in order to ensure that you wont run out of memory, more details at
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

Scheduled Job Task

Subject:
I’m trying to implement a basic job scheduling in Java to handle recurrent persisted scheduled task (for a personal learn project). I don’t want to use any (ready-to-use) libraries like Quartz/Obsidian/Cron4J/etc.
Objective:
Job have to be persistent (to handle server shutdown)
Job execution time can take up to ~2-5 mn.
Manage a large amount of job
Multithread
Light and fast ;)
All my job are in a MySQL Database.
JOB_TABLE (id, name, nextExecution,lastExecution, status(IDLE,PENDING,RUNNING))
Step by step:
Retrieve each job from “JOB_TABLE” where “nextExecution > now” AND “status = IDLE“. This step is executed every 10mn by a single thread.
For each job retrieved, I put a new thread in a ThreadPoolExecutor then I update the job status to “PENDING” in my “JOB_TABLE”.
When the job thread is running, I update the job status to “RUNNING”.
When the job is finished, I update the lastExecution with current time, I set a new nextExecution time and I change the job status to “IDLE”.
When server is starting, I put each PENDING/RUNNING job in the ThreadPoolExecutor.
Question/Observation:
Step 2 : Will the ThreadPoolExecutor handle a large amount of thread (~20000) ?
Should I use a NoSQL solution instead of MySQL ?
Is it the best solution to deal with such use case ?
This is a draft, there is no code behind. I’m open to suggestion, comments and criticism!

I have done similar to your task on a real project, but in .NET. Here is what I can recall regarding your questions:
Step 2 : Will the ThreadPoolExecutor handle a large amount of thread (~20000)?
We discovered that .NET's built-in thread pool was the worst approach, as the project was a web application. Reason: the web application relies on the built-in thread pool (which is static and thus shared for all uses within the running process) to run each request in separate thread, while maintain effective recycling of threads. Employing the same thread pool for our internal processing was going to exhaust it and leave no free threads for the user requests, or spoil their performance, which was unacceptable.
As you seem to be running quite a lot of jobs (20k is a lot for a single machine) then you definitely should look for a custom thread pool. No need to write your own though, I bet there are ready solutions and writing one is far beyond what your study project would require* see the comments (if I understand correctly you are doing a school or university project).
Should I use a NoSQL solution instead of MySQL?
Depends. You obviously need to update the job status concurrently, thus, you will have simultaneous access to one single table from multiple threads. Databases can scale pretty well to that, assuming you did your thing right. Here is what I refer to doing this right:
Design your code in a way that each job will affect only its own subset of rows in the database (this includes other tables). If you are able to do so, you will not need any explicit locks on database level (in the form of transaction serialization levels). You can even enforce a liberal serialization level that may allow dirty or phantom reads - that will perform faster. But beware, you must carefully ensure no jobs will concur over the same rows. This is hard to achieve in real-life projects, so you should probably look for alternative approaches in db locking.
Use appropriate transaction serialization mode. The transaction serialization mode defines the lock behavior on database level. You can set it to lock the entire table, only the rows you affect, or nothing at all. Use it wisely, as any misuse could affect the data consistency, integrity and the stability of the entire application or db server.
I am not familiar with NoSQL database, so I can only advice you to research on the concurrency capabilities and map them to your scenario. You could end up with a really suitable solution, but you have to check according to your needs. From your description, you will have to support simultaneous data operations over the same type of objects (what is the analog for a table).
Is it the best solution to deal with such use case ?
Yes and No.
Yes, as you will encounter one of the difficult tasks developers are facing in real world. I have worked with colleagues having more than 3 times my own experience and they were more reluctant to do multi-threading tasks than me, they really hated that. If you feel this area is interesting to you, play with it, learn and improve as much as you have to.
No, because if you are working on a real-life project, you need something reliable. If you have so many questions, you will obviously need time to mature and be able to produce a stable solution for such a task. Multi-threading is a difficult topic for many reasons:
It is hard to debug
It introduces many points of failure, you need to be aware of all of them
It could be a pain for other developers to assist or work with your code, unless you sticked to commonly accepted rules.
Error handling can be tricky
Behavior is unpredictable / undeterministic.
There are existing solutions with high level of maturity and reliability that are the preferred approach for real projects. Drawback is that you will have to learn them and examine how customizable they are for your needs.
Anyway, if you need to do it your way, and then port your achievement to a real project, or a project of your own, I can advice you to do this in a pluggable way. Use abstraction, programming to interfaces and other practices to decouple your own specific implementation from the logic that will set the scheduled jobs. That way, you can adapt your api to an existing solution if this becomes a problem.
And last, but not least, I did not see any error-handling predictions on your side. Think and research on what to do if a job fails. At least add a 'FAILED' status or something to persist in such case. Error handling is tricky when it comes to threads, so be thorough on your research and practices.
Good luck

You can declare the maximum pool size with ThreadPoolExecutor#setMaximumPoolSize(int). As Integer.MAX is larger 20000 then technically yes it can.
The other question is that does your machine wold support so many thread to run. You will have provide enough RAM so each tread will allocate on stack.
Thee should not be problem to address ~20,000 threads on modern desktop or laptop but on mobile device it could be an issue.
From doc:
Core and maximum pool sizes
A ThreadPoolExecutor will automatically
adjust the pool size (see getPoolSize()) according to the bounds set
by corePoolSize (see getCorePoolSize()) and maximumPoolSize (see
getMaximumPoolSize()). When a new task is submitted in method
execute(java.lang.Runnable), and fewer than corePoolSize threads are
running, a new thread is created to handle the request, even if other
worker threads are idle. If there are more than corePoolSize but less
than maximumPoolSize threads running, a new thread will be created
only if the queue is full. By setting corePoolSize and maximumPoolSize
the same, you create a fixed-size thread pool. By setting
maximumPoolSize to an essentially unbounded value such as
Integer.MAX_VALUE, you allow the pool to accommodate an arbitrary
number of concurrent tasks. Most typically, core and maximum pool
sizes are set only upon construction, but they may also be changed
dynamically using setCorePoolSize(int) and setMaximumPoolSize(int).
More
About the DB. Create a solution that is not depend to DB structure. Then you can set up two enviorements and measure it. Start with the technology that you know. But keep open to other solutions. At the begin the relations DB should keep up with the performance. And if you mange it properly the it should not be an issue later. The NoSQL are used to work with really big data. But the best for you is to create both and run some performace tests.

Java - what's so great about Executors?

In a life without Java Executors, new threads would have to be created for each Runnable tasks. Making new threads requires thread overhead (creation and teardown) that adds complexity and wasted time to a non-Executor program.
Referring to code:
no Java Executor -
new Thread (aRunnableObject).start ();
with Java Executor -
Executor executor = some Executor factory method;
exector.execute (aRunnable);
Bottom line is that Executors abstract the low-level details of how to manage threads.
Is that true?
Thanks.

Bottom line is that Executors abstract the low-level details of how to manage threads. Is that true?
Yes.
They deal with issues such as creating the thread objects, maintaining a pool of threads, controlling the number of threads are running, and graceful / less that graceful shutdown. Doing these things by hand is non-trivial.
EDIT
There may or may not be a performance hit in doing this ... compared with a custom implementation perfectly tuned to the precise needs of your application. But the chances are that:
your custom implementation wouldn't be perfectly tuned, and
the performance difference wouldn't be significant anyway.
Besides, the Executor support classes allow you to simply tune various parameters (e.g. thread pool sizes) if there is an issue that needs to be addressed. I don't see how garbage collection overheads would be significantly be impacted by using Executors, one way or the other.
As a general rule, you should focus on writing your applications simply and robustly (e.g. using the high level concurrency support classes), and only worry about performance if:
your application is running "too slow", and
the profiling tools tell you that you've got a problem in a particular area.

Couple of benefits of executors as against normal threads.
Throttling can be achieved easily by varying the size of ThreadPools. This helps keeping control/check on the number of threads flowing through your application. Particularly helpful when benchmarking your application for load bearing.
Better management of Runnable tasks can be achieved using the RejectionHandlers.

I think all that executors do is that they will do the low level tasks
for you, but you still have to judiciously decide which thread pool do
you want. I mean if your use case needs maximum 5 threads and you go
and use thread pool having 100 threads, then certainly it is going to
have impact on performance. Other than this there is noting extra
being done at low level which is going to halt the system. And last of
all, it is always better to get an idea what is being done at low
level so that it will give us fair idea about the underground things.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.