Shard resources by thread?

Shard resources by thread? - java

I have a (limited) thread pool which executes CPU-bound tasks. I'd like to aggregate some numerical statistics from each of these threads in a single place. Basically: each thread will update some shared stats (e.g. how long its job took) at a very high frequency and, at some much slower interval, a 'stat reader' would query those stats.
My first thought was to use some shared atomics and update them from each thread. This works ok, but in my testing the overhead of the atomics can get pretty high with a lot of contention so I was trying to think of some other alternatives.
My second though was a sort of 'sharding' scheme, where each thread had its own stats object that it could update without requiring any synchronization. The 'stat reader' could then aggregate the stats from each thread into an overall stat value.
My first question is: does the thread sharding scheme make sense? Does something like that exist that I'm reinventing?
My second question is: if the sharding scheme does make sense, I'm trying to think of the best way to map threads to their shard:
1) Use the thread's ID mod some shard value to get a shard index, but I don't think that's reliable as I think the thread id value is shared, so I could get a collision.
2) Adding a thread-local index to the thread, but I don't think that will play nicely with the ExecutorService.
3) I could subclass Thread, but then I'd have to cast it when I wanted to access this which I'd rather avoid, if possible.
4) When the thread is created, create a mapping of its name to its shard. This would work, but there would be a race when creating the threads: one could be looking up its shard while we're adding a new shard to the map, causing concurrency issues.
Wondering if I'm way off-base here and overthinking it (seems like it would be a common problem?) or if one of these schemes does make sense for the use case.

One way to solve this is to use the LongAdder class that avoids the contention that plain old atomics suffer from.
A more hand-written approach would be to create some class that holds the statistics you want to gather for each thread, and then have an array of these objects such that each thread's stats object is in array[thread.getId() % NUM_THREADS]. The reader thread can then traverse the array and gather the stats as it pleases.
The trick to getting this to work efficiently is to avoid false sharing. That is, threads on different cores perform updates on their respective objects but those objects happen to reside on the same cacheline, causing massive amounts of unnecessary cache coherence traffic.
In Java 8, there is the #Contended annotation that you might want to look into. The old way of padding your class with a bunch of long fields doesn't work anymore since unused fields will be optimized away.

I would suggest you use different way: Actor.
The actor model provides a relatively simple but powerful model for designing and implementing applications that can distribute and share work across all system resources—from threads and cores to clusters of servers and data centers. It provides an effective framework for building applications with high levels of concurrency and for increasing levels of resource efficiency. Importantly, the actor model also has well-defined ways for handling errors and failures gracefully, ensuring a level of resilience that isolates issues and prevents cascading failures and massive downtime.
You can turn to Akka i think.

Related

java mechanical sympathy trough thread pinning

Given we have an application that is heavily polluted with concurrency constructs
multiple techniques are used (different people worked without clear architecture in mind),
multiple questionable locks that are there "just in case", thread safe queues. CPU usage is around 20%.
Now my goal is to optimize it such that it is making better use of caches and generally improve its performance and service time.
I'm considering to pin the parent process to a single core, remove all things that cause membars,
replace all thread safe data structures and replace all locks with some UnsafeReentrantLock
which would simply use normal reference field but take care of exclusive execution
needs...
I expect that we would end up with much more cache friendly application,
since we don't have rapid cache flushes all the time (no membars).
We would have less overhead since we dont need thread safe data structures,
volaties, atomics and replace all sorts of locks with I would assume that service time would improve also,
since we no longer synchronize on multiple thread safe queues...
Is there something that I'm overlooking here?
Maybe blocking operations would have to be paid attention to since they would not show up in that 20% usage?

Migrating expensive to initialize java.util.concurrent.Callables to Apache Spark

I need to migrate a Java program to Apache Spark. The current Java heavily utilizes the functionality provided by java.util.concurrent and runs on a single machine. Since the initialization of a worker (Callable) is expensive, the workers are reused again and again - i.e. a worker reinserts itself into the pool once it terminates and has returned its result.
More precise:
The current implementation works on small data sets in the range of 10E06 entries/few GBs.
The data contains entries that can be processed independently. That is, one could fire up one worker per task and submit it to the java thread pool.
However, setting up a worker for processing an entry involves loading more data in, building graphs... all together some GB AND cpu time in the range of minutes.
Some data can indeed be shared among the workers e.g. some look-up tables but does not need to. Some data is private to the worker and thus not shared. The worker may change the data while processing the entry and only later reset it in a fast manner, e.g. caches specific to the entry currently processed. Thus, the worker can reinsert itself in the pool and start working on the next entry without going though the expensive initialization.
Runtime per worker and entry is in the range of seconds.
The workers hand back their results via an ExecutorCompletionService, i.e. the results are later retrieved by calling pool.take().get() in a central part of the program.
Getting to know Apache Spark I find most examples just use standard transformations and actions. I also find examples that add their own functions to the DAG by extending the API. Still, those examples all stick to simple lightweight calculations and come without initialization cost.
I now wonder what is the best approach to design a Spark application that reuses some kind of "heavy worker". The executors seem to be the only persistent entities that could possibly hold a pool of such workers. However, being new to the world of Spark I most likely miss some point...
edited 20161007
Found an answer that points to a (possible) solution using Functions. So the question is, can I
Split my partition according to the number of executors,
Each executor gets exactly one partition to work on
My Function (called setup in the linked solution) creates a thread pool and reuses the workers
A separate combine function later merges the results

Your current architecture is a monolithic, multi-threaded architecture with shared state between the threads. Given that the size of your dataset is relatively modest for modern hardware you can parallelize it quite easily with Spark, where you will replace the threads with executors in the cluster's nodes.
From your question I understand that your two main concerns is whether Spark can handle complex parallel computations and how to share the necessary bits of state in a distributed environment.
Complicated business logic: Regarding the first part, you can run arbitrarily complicated business logic in the Spark Executors, which are the equivalent of the worker threads in your current architecture.
This blog post from cloudera explains well the concept along with other important concepts of the execution model:
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
One aspect you will need to pay attention to it though, is the configuration of your Spark job, in order to avoid timeouts due to Executors taking too long to finish, which may be expected for an application with complicated business logic like yours.
Refer to the excellent page from DataBricks for more details, and more specifically to the execution behavior:
http://spark.apache.org/docs/latest/configuration.html#execution-behavior
Shared state: You can share complicated data structures like graphs and application configuration in Spark among the nodes. One approach which works well is Broadcast Variables, where a copy of the state to be distributed is distributed to every node. Below are some very nice explanations of the concept:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-broadcast.html
http://g-chi.github.io/2015/10/21/Spark-why-use-broadcast-variables/
This will shave the latency from your application, while ensuring data locality.
The processing of your data can be performed on a partition based (more here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-partitions.html), with the results aggregated on the driver or with the use of Accumulators (more here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-accumulators.html). In case the resulted data are complicated, the partition approach may work better and also gives you more fine grained control over your applications execution.
Regarding the hardware resource requirements, it seems that your application needs a few Gigabytes for the shared state, which will need to stay in memory and additionally a few more Gigabytes for the data in every node. You can set the persistence model to MEMORY_AND_DISK in order to ensure that you wont run out of memory, more details at
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

Scheduled Job Task

Subject:
I’m trying to implement a basic job scheduling in Java to handle recurrent persisted scheduled task (for a personal learn project). I don’t want to use any (ready-to-use) libraries like Quartz/Obsidian/Cron4J/etc.
Objective:
Job have to be persistent (to handle server shutdown)
Job execution time can take up to ~2-5 mn.
Manage a large amount of job
Multithread
Light and fast ;)
All my job are in a MySQL Database.
JOB_TABLE (id, name, nextExecution,lastExecution, status(IDLE,PENDING,RUNNING))
Step by step:
Retrieve each job from “JOB_TABLE” where “nextExecution > now” AND “status = IDLE“. This step is executed every 10mn by a single thread.
For each job retrieved, I put a new thread in a ThreadPoolExecutor then I update the job status to “PENDING” in my “JOB_TABLE”.
When the job thread is running, I update the job status to “RUNNING”.
When the job is finished, I update the lastExecution with current time, I set a new nextExecution time and I change the job status to “IDLE”.
When server is starting, I put each PENDING/RUNNING job in the ThreadPoolExecutor.
Question/Observation:
Step 2 : Will the ThreadPoolExecutor handle a large amount of thread (~20000) ?
Should I use a NoSQL solution instead of MySQL ?
Is it the best solution to deal with such use case ?
This is a draft, there is no code behind. I’m open to suggestion, comments and criticism!

I have done similar to your task on a real project, but in .NET. Here is what I can recall regarding your questions:
Step 2 : Will the ThreadPoolExecutor handle a large amount of thread (~20000)?
We discovered that .NET's built-in thread pool was the worst approach, as the project was a web application. Reason: the web application relies on the built-in thread pool (which is static and thus shared for all uses within the running process) to run each request in separate thread, while maintain effective recycling of threads. Employing the same thread pool for our internal processing was going to exhaust it and leave no free threads for the user requests, or spoil their performance, which was unacceptable.
As you seem to be running quite a lot of jobs (20k is a lot for a single machine) then you definitely should look for a custom thread pool. No need to write your own though, I bet there are ready solutions and writing one is far beyond what your study project would require* see the comments (if I understand correctly you are doing a school or university project).
Should I use a NoSQL solution instead of MySQL?
Depends. You obviously need to update the job status concurrently, thus, you will have simultaneous access to one single table from multiple threads. Databases can scale pretty well to that, assuming you did your thing right. Here is what I refer to doing this right:
Design your code in a way that each job will affect only its own subset of rows in the database (this includes other tables). If you are able to do so, you will not need any explicit locks on database level (in the form of transaction serialization levels). You can even enforce a liberal serialization level that may allow dirty or phantom reads - that will perform faster. But beware, you must carefully ensure no jobs will concur over the same rows. This is hard to achieve in real-life projects, so you should probably look for alternative approaches in db locking.
Use appropriate transaction serialization mode. The transaction serialization mode defines the lock behavior on database level. You can set it to lock the entire table, only the rows you affect, or nothing at all. Use it wisely, as any misuse could affect the data consistency, integrity and the stability of the entire application or db server.
I am not familiar with NoSQL database, so I can only advice you to research on the concurrency capabilities and map them to your scenario. You could end up with a really suitable solution, but you have to check according to your needs. From your description, you will have to support simultaneous data operations over the same type of objects (what is the analog for a table).
Is it the best solution to deal with such use case ?
Yes and No.
Yes, as you will encounter one of the difficult tasks developers are facing in real world. I have worked with colleagues having more than 3 times my own experience and they were more reluctant to do multi-threading tasks than me, they really hated that. If you feel this area is interesting to you, play with it, learn and improve as much as you have to.
No, because if you are working on a real-life project, you need something reliable. If you have so many questions, you will obviously need time to mature and be able to produce a stable solution for such a task. Multi-threading is a difficult topic for many reasons:
It is hard to debug
It introduces many points of failure, you need to be aware of all of them
It could be a pain for other developers to assist or work with your code, unless you sticked to commonly accepted rules.
Error handling can be tricky
Behavior is unpredictable / undeterministic.
There are existing solutions with high level of maturity and reliability that are the preferred approach for real projects. Drawback is that you will have to learn them and examine how customizable they are for your needs.
Anyway, if you need to do it your way, and then port your achievement to a real project, or a project of your own, I can advice you to do this in a pluggable way. Use abstraction, programming to interfaces and other practices to decouple your own specific implementation from the logic that will set the scheduled jobs. That way, you can adapt your api to an existing solution if this becomes a problem.
And last, but not least, I did not see any error-handling predictions on your side. Think and research on what to do if a job fails. At least add a 'FAILED' status or something to persist in such case. Error handling is tricky when it comes to threads, so be thorough on your research and practices.
Good luck

You can declare the maximum pool size with ThreadPoolExecutor#setMaximumPoolSize(int). As Integer.MAX is larger 20000 then technically yes it can.
The other question is that does your machine wold support so many thread to run. You will have provide enough RAM so each tread will allocate on stack.
Thee should not be problem to address ~20,000 threads on modern desktop or laptop but on mobile device it could be an issue.
From doc:
Core and maximum pool sizes
A ThreadPoolExecutor will automatically
adjust the pool size (see getPoolSize()) according to the bounds set
by corePoolSize (see getCorePoolSize()) and maximumPoolSize (see
getMaximumPoolSize()). When a new task is submitted in method
execute(java.lang.Runnable), and fewer than corePoolSize threads are
running, a new thread is created to handle the request, even if other
worker threads are idle. If there are more than corePoolSize but less
than maximumPoolSize threads running, a new thread will be created
only if the queue is full. By setting corePoolSize and maximumPoolSize
the same, you create a fixed-size thread pool. By setting
maximumPoolSize to an essentially unbounded value such as
Integer.MAX_VALUE, you allow the pool to accommodate an arbitrary
number of concurrent tasks. Most typically, core and maximum pool
sizes are set only upon construction, but they may also be changed
dynamically using setCorePoolSize(int) and setMaximumPoolSize(int).
More
About the DB. Create a solution that is not depend to DB structure. Then you can set up two enviorements and measure it. Start with the technology that you know. But keep open to other solutions. At the begin the relations DB should keep up with the performance. And if you mange it properly the it should not be an issue later. The NoSQL are used to work with really big data. But the best for you is to create both and run some performace tests.

Thread Pool vs Many Individual Threads

I'm in the middle of a problem where I am unable decide which solution to take.
The problem is a bit unique. Lets put it this way, i am receiving data from the network continuously (2 to 4 times per second). Now each data belongs to a different, lets say, group.
Now, lets call these groups, group1, group2 and so on.
Each group has a dedicated job queue where data from the network is filtered and added to its corresponding group for processing.
At first I created a dedicated thread per group which would take data from the job queue, process it and then goes to blocking state (using Linked Blocking Queue).
But my senior suggested that i should use thread pools because this way threads wont get blocked and will be usable by other groups for processing.
But here is the thing, the data im getting is fast enough and the time a thread takes to process it is long enough for the thread to, possibly, not go into blocking mode. And this will also guarantee that data gets processed sequentially (job 1 gets done before job 2), which in pooling, very little chances are, might not happen.
My senior is also bent on the fact that pooling will also save us lots of memory because threads are POOLED (im thinking he really went for the word ;) ). While i dont agree to this because, i personally think, pooled or not each thread gets its own stack memory. Unless there is something in thread pools which i am not aware of.
One last thing, I always thought that pooling helps where jobs appear in a big number for short time. This makes sense because thread spawning would be a performance kill because of the time taken to init a thread is lot more than time spent on doing the job. So pooling helps a lot here.
But in my case group1, group2,...,groupN always remain alive. So if there is data or not they will still be there. So thread spawning is not the issue here.
My senior is not convinced and wants me to go with the pooling solution because its memory footprint is great.
So, which path to take?
Thank you.

Good question.
Pooling indeed saves you initialization time, as you said. But it has another aspect: resource management. And here I am asking you this- just how many groups (read- dedicated threads) do you have?
do they grow dynamically during the execution span of the application?
For example, consider a situation where the answer to this question is yes. new Groups types are added dynamically. In this case, you might not want to dedicate a a thread to each one since there is technically no restrictions on the amount of groups that will be created, you will create a lot of threads and the system will be context switching instead of doing real work.
Threadpooling to the rescue- thread pool allows you to specify a restriction on the maxumal number of threads that could be possibly created, with no regard to load. So the application may deny service from certain requests, but the ones that get through are handled properly, without critically depleting the system resources.
Considering the above, I is very possible that in your case, it is very much OK to have a dedicated
thread for each group!
The same goes for your senior's conviction that it will save memory.. Indeed, a thread takes up memory on the heap, but is it really so much, if it is a predefined amount, say 5. Even 10- it is probably OK. Anyway, you should not use pooling unless you are a-priory and absolutely convinced that you actually have a problem!
Pooling is a design decision, not an architectural one. You can not-pool at the beggining and proceed with optimizations in case you find pooling to be beneficial after you encountered a performance issue.
Considering the serialization of requests (in order execution) it is no matter whether you are using a threadpool or a dedicated thread. The sequential execution is a property of the queue coupled with a single handler thread.

Creating a thread will consume resources, including the default stack per thread (IIR 512Kb, but configurable). So the advantage to pooling is that you incur a limited resource hit. Of course you need to size your pool according to the work that you have to perform.
For your particular problem, I think the key is to actually measure performance/thread usage etc. in each scenario. Unless your running into constraints I perhaps wouldn't worry either way, other than to make sure that you can swap one implementation for another without a major impact on your application. Remember that premature optimisation is the root of all evil. Note that:
"Premature optimization" is a phrase used to describe a situation
where a programmer lets performance considerations affect the design
of a piece of code. This can result in a design that is not as clean
as it could have been or code that is incorrect, because the code is
complicated by the optimization and the programmer is distracted by
optimizing.

Performance ConcurrentHashmap vs HashMap

How is the performance of ConcurrentHashMap compared to HashMap, especially .get() operation (I'm especially interested for the case of only few items, in the range between maybe 0-5000)?
Is there any reason not to use ConcurrentHashMap instead of HashMap?
(I know that null values aren't allowed)
Update
just to clarify, obviously the performance in case of actual concurrent access will suffer, but how compares the performance in case of no concurrent access?

I was really surprised to find this topic to be so old and yet no one has yet provided any tests regarding the case. Using ScalaMeter I have created tests of add, get and remove for both HashMap and ConcurrentHashMap in two scenarios:
using single thread
using as many threads as I have cores available. Note that because HashMap is not thread-safe, I simply created separate HashMap for each thread, but used one, shared ConcurrentHashMap.
Code is available on my repo.
The results are as follows:
X axis (size) presents number of elements written to the map(s)
Y axis (value) presents time in milliseconds
The summary
If you want to operate on your data as fast as possible, use all the threads available. That seems obvious, each thread has 1/nth of the full work to do.
If you choose a single thread access use HashMap, it is simply faster. For add method it is even as much as 3x more efficient. Only get is faster on ConcurrentHashMap, but not much.
When operating on ConcurrentHashMap with many threads it is similarly effective to operating on separate HashMaps for each thread. So there is no need to partition your data in different structures.
To sum up, the performance for ConcurrentHashMap is worse when you use with single thread, but adding more threads to do the work will definitely speed-up the process.
Testing platform
AMD FX6100, 16GB Ram
Xubuntu 16.04, Oracle JDK 8 update 91, Scala 2.11.8

Thread safety is a complex question. If you want to make an object thread safe, do it consciously, and document that choice. People who use your class will thank you if it is thread safe when it simplifies their usage, but they will curse you if an object that once was thread safe becomes not so in a future version. Thread safety, while really nice, is not just for Christmas!
So now to your question:
ConcurrentHashMap (at least in Sun's current implementation) works by dividing the underlying map into a number of separate buckets. Getting an element does not require any locking per se, but it does use atomic/volatile operations, which implies a memory barrier (potentially very costly, and interfering with other possible optimisations).
Even if all the overhead of atomic operations can be eliminated by the JIT compiler in a single-threaded case, there is still the overhead of deciding which of the buckets to look in - admittedly this is a relatively quick calculation, but nevertheless, it is impossible to eliminate.
As for deciding which implementation to use, the choice is probably simple.
If this is a static field, you almost certainly want to use ConcurrentHashMap, unless testing shows this is a real performance killer. Your class has different thread safety expectations from the instances of that class.
If this is a local variable, then chances are a HashMap is sufficient - unless you know that references to the object can leak out to another thread. By coding to the Map interface, you allow yourself to change it easily later if you discover a problem.
If this is an instance field, and the class hasn't been designed to be thread safe, then document it as not thread safe, and use a HashMap.
If you know that this instance field is the only reason the class isn't thread safe, and are willing to live with the restrictions that promising thread safety implies, then use ConcurrentHashMap, unless testing shows significant performance implications. In that case, you might consider allowing a user of the class to choose a thread safe version of the object somehow, perhaps by using a different factory method.
In either case, document the class as being thread safe (or conditionally thread safe) so people who use your class know they can use objects across multiple threads, and people who edit your class know that they must maintain thread safety in future.

I would recommend you measure it, since (for one reason) there may be some dependence on the hashing distribution of the particular objects you're storing.

The standard hashmap provides no concurrency protection whereas the concurrent hashmap does. Before it was available, you could wrap the hashmap to get thread safe access but this was coarse grain locking and meant all concurrent access got serialised which could really impact performance.
The concurrent hashmap uses lock stripping and only locks items that affected by a particular lock. If you're running on a modern vm such as hotspot, the vm will try and use lock biasing, coarsaning and ellision if possible so you'll only pay the penalty for the locks when you actually need it.
In summary, if your map is going to be accesaed by concurrent threads and you need to guarantee a consistent view of it's state, use the concurrent hashmap.

In the case of a 1000 element hash table using 10 locks for whole table saves close to half the time when 10000 threads are inserting and 10000 threads are deleting from it.
The interesting run time difference is here
Always use Concurrent data structure. except when the downside of striping (mentioned below) becomes a frequent operation. In that case you will have to acquire all the locks? I read that the best ways to do this is by recursion.
Lock striping is useful when there is a way of breaking a high contention lock into multiple locks without compromising data integrity. If this is possible or not should take some thought and is not always the case. The data structure is also the contributing factor to the decision. So if we use a large array for implementing a hash table, using a single lock for the entire hash table for synchronizing it will lead to threads sequentially accessing the data structure. If this is the same location on the hash table then it is necessary but, what if they are accessing the two extremes of the table.
The down side of lock striping is it is difficult to get the state of the data structure that is affected by striping. In the example the size of the table, or trying to list/enumerate the whole table may be cumbersome since we need to acquire all of the striped locks.

What answer are you expecting here?
It is obviously going to depend on the number of reads happening at the same time as writes and how long a normal map must be "locked" on a write operation in your app (and whether you would make use of the putIfAbsent method on ConcurrentMap). Any benchmark is going to be largely meaningless.

It's not clear what your mean. If you need thread safeness, you have almost no choice - only ConcurrentHashMap. And it's definitely have performance/memory penalties in get() call - access to volatile variables and lock if you're unlucky.

Of course a Map without any lock system wins against one with thread-safe behavior which needs more work.
The point of the Concurrent one is to be thread safe without using synchronized so to be faster than HashTable.
Same graphics would would be very interesting for ConcurrentHashMap vs Hashtable (which is synchronized).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.