Quick Background: As I am going back and redesigning some critical parts of an application, I keep wondering about locking and its impact on performance. The app has a large Tree style data structure which caches data/DTO from the database. Updates to the large tree can come about in two main ways: 1. user triggered commands, 2. auto updates from jobs that ran in the background.
When either operation type occurs (user/auto), I am locking down (explicitly locking) the data structure. I was running into consistency issues, so locking down everything seemed to make the most sense to protect the integrity of the data in the cache.
Question: Since many auto updates can occur at once I was thinking of implementing some kind of queue (JMS maybe) to handle instructions to the data structure, where any user driven updates get pushed to the top and handled first. When it comes to handling a bulk/unknown size set of auto "tasks", I am trying to figure out if I should let them run and lock individually or try and bulk them together by time and interact with locking once. The real crux of the problem is that any one of the tasks to update could affect the entire tree.
In terms of overall performance (general, nothing specific), is it more efficient to have many transactions locking potentially doing large updates, or try and combine to one massive bulk update and only lock once but for a lot longer? I know a lot of this probably hinges on the data, the type of updates, frequency, etc. I didn't know if there was a general rule of thumb of "smaller more frequent locks" or "one large potentially longer" lock.
I think the answer depends on whether your program spends any significant time with the data structure unlocked. If it does not, I recommend locking once for all pending updates.
The reason is, that other threads that may be waiting for the lock may get woken up and then uselessly sent back to sleep when the update thread quickly locks the resource again. Or the update is interrupted by another thread which is likely bad for cache utilization. Also there is a cost to locking which may be small compared to your update: pipelines may have to be flushed, memory accesses may not be freely reordered, etc.
If the thread spends some time between updates without having to lock the data structure, I would consider relocking for every update if it is expected that other threads can complete their transactions inbetween and contention is thereby reduced.
Note that when there are different priorities for different updates like I presume for your user updates versus the background updates, it may be a bad idea to lock down the data structure for a long time for lower priority updates if this could in any way prevent higher priority tasks from running.
If you end up implementing some kind of a queue, then you lose all concurrency. If you get 1000 requests at once, think of how inefficient that is.
Try taking a look at this code for concurrent trees.
https://github.com/npgall/concurrent-trees
Related
I understand that if this was about a HashMap or some other complex object I would still need to add synchronized. But is this also the case for primitives? My intuitive feeling is that I don't need it, but I'm not certain.
If you do not add a 'happens before' relation between a read and a write, you could end up with a data race. if there is a data race, all bets are off. The compiler could optimize the code in such a way you will never see the new value.
If you want to have very cheap access, you could do an acquire load and a store release.
E.g. https://docs.oracle.com/en/java/javase/13/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html
AtomicLong.getAcuire
AtomicLong.storeRelease
On the X86 platform every load is an acquire load and every store is a release store. So you will get this totally for free on the hardware level. However it will prevent certain compiler optimizations.
If you care a little bit less for extreme performance, then a volatile would be sufficient. This will give you a sequential consistent load and store. The main issue on the hardware level is that it blocks the CPU from executing any loads till store buffer is drained to the Level 1 cache. A volatile load is equally expensive on a hardware level as an acquire load; the price for sequential consistency is at the write.
I have a (limited) thread pool which executes CPU-bound tasks. I'd like to aggregate some numerical statistics from each of these threads in a single place. Basically: each thread will update some shared stats (e.g. how long its job took) at a very high frequency and, at some much slower interval, a 'stat reader' would query those stats.
My first thought was to use some shared atomics and update them from each thread. This works ok, but in my testing the overhead of the atomics can get pretty high with a lot of contention so I was trying to think of some other alternatives.
My second though was a sort of 'sharding' scheme, where each thread had its own stats object that it could update without requiring any synchronization. The 'stat reader' could then aggregate the stats from each thread into an overall stat value.
My first question is: does the thread sharding scheme make sense? Does something like that exist that I'm reinventing?
My second question is: if the sharding scheme does make sense, I'm trying to think of the best way to map threads to their shard:
1) Use the thread's ID mod some shard value to get a shard index, but I don't think that's reliable as I think the thread id value is shared, so I could get a collision.
2) Adding a thread-local index to the thread, but I don't think that will play nicely with the ExecutorService.
3) I could subclass Thread, but then I'd have to cast it when I wanted to access this which I'd rather avoid, if possible.
4) When the thread is created, create a mapping of its name to its shard. This would work, but there would be a race when creating the threads: one could be looking up its shard while we're adding a new shard to the map, causing concurrency issues.
Wondering if I'm way off-base here and overthinking it (seems like it would be a common problem?) or if one of these schemes does make sense for the use case.
One way to solve this is to use the LongAdder class that avoids the contention that plain old atomics suffer from.
A more hand-written approach would be to create some class that holds the statistics you want to gather for each thread, and then have an array of these objects such that each thread's stats object is in array[thread.getId() % NUM_THREADS]. The reader thread can then traverse the array and gather the stats as it pleases.
The trick to getting this to work efficiently is to avoid false sharing. That is, threads on different cores perform updates on their respective objects but those objects happen to reside on the same cacheline, causing massive amounts of unnecessary cache coherence traffic.
In Java 8, there is the #Contended annotation that you might want to look into. The old way of padding your class with a bunch of long fields doesn't work anymore since unused fields will be optimized away.
I would suggest you use different way: Actor.
The actor model provides a relatively simple but powerful model for designing and implementing applications that can distribute and share work across all system resources—from threads and cores to clusters of servers and data centers. It provides an effective framework for building applications with high levels of concurrency and for increasing levels of resource efficiency. Importantly, the actor model also has well-defined ways for handling errors and failures gracefully, ensuring a level of resilience that isolates issues and prevents cascading failures and massive downtime.
You can turn to Akka i think.
Given we have an application that is heavily polluted with concurrency constructs
multiple techniques are used (different people worked without clear architecture in mind),
multiple questionable locks that are there "just in case", thread safe queues. CPU usage is around 20%.
Now my goal is to optimize it such that it is making better use of caches and generally improve its performance and service time.
I'm considering to pin the parent process to a single core, remove all things that cause membars,
replace all thread safe data structures and replace all locks with some UnsafeReentrantLock
which would simply use normal reference field but take care of exclusive execution
needs...
I expect that we would end up with much more cache friendly application,
since we don't have rapid cache flushes all the time (no membars).
We would have less overhead since we dont need thread safe data structures,
volaties, atomics and replace all sorts of locks with I would assume that service time would improve also,
since we no longer synchronize on multiple thread safe queues...
Is there something that I'm overlooking here?
Maybe blocking operations would have to be paid attention to since they would not show up in that 20% usage?
I have this statement, which came from Goetz's Java Concurrency In Practice:
Runtime overhead of threads due to context switching includes saving and restoring execution context, loss of locality, and CPU time spent scheduling threads instead of running them.
What is meant by "loss of locality"?
When a thread works, it often reads data from memory and from disk. The data is often stored in contiguous or close locations in memory/on the disk (for example, when iterating over an array, or when reading the fields of an object). The hardware bets on that by loading blocks of memory into fast caches so that access to contiguous/close memory locations is faster.
When you have a high number of threads and you switch between them, those caches often need to be flushed and reloaded, which makes the code of a thread take more time than if it was executed all at once, without having to switch to other threads and come back later.
A bit like we humans need some time to get back to a task after being interrupted, find where we were, what we were doing, etc.
Just to elaborate the point of "cache miss" made by JB Nizet.
As a thread runs on a core, it keeps recently used data in the L1/L2 cache which are local to the core. Modern processors typically read data from L1/L2 cache in about 5-7 ns.
When, after a pause (from being interrupted, put on wait queue etc) a thread runs again, it most likely will run on a different core. This means that the L1/L2 cache of this new core has no data related to the work that the thread was doing. It now needs to goto the main memory (which takes about 100 ns) to load data before proceeding to work.
There are ways to mitigate this issue by pinning threads to a specific core by using a thread affinity library.
I'm in the middle of a problem where I am unable decide which solution to take.
The problem is a bit unique. Lets put it this way, i am receiving data from the network continuously (2 to 4 times per second). Now each data belongs to a different, lets say, group.
Now, lets call these groups, group1, group2 and so on.
Each group has a dedicated job queue where data from the network is filtered and added to its corresponding group for processing.
At first I created a dedicated thread per group which would take data from the job queue, process it and then goes to blocking state (using Linked Blocking Queue).
But my senior suggested that i should use thread pools because this way threads wont get blocked and will be usable by other groups for processing.
But here is the thing, the data im getting is fast enough and the time a thread takes to process it is long enough for the thread to, possibly, not go into blocking mode. And this will also guarantee that data gets processed sequentially (job 1 gets done before job 2), which in pooling, very little chances are, might not happen.
My senior is also bent on the fact that pooling will also save us lots of memory because threads are POOLED (im thinking he really went for the word ;) ). While i dont agree to this because, i personally think, pooled or not each thread gets its own stack memory. Unless there is something in thread pools which i am not aware of.
One last thing, I always thought that pooling helps where jobs appear in a big number for short time. This makes sense because thread spawning would be a performance kill because of the time taken to init a thread is lot more than time spent on doing the job. So pooling helps a lot here.
But in my case group1, group2,...,groupN always remain alive. So if there is data or not they will still be there. So thread spawning is not the issue here.
My senior is not convinced and wants me to go with the pooling solution because its memory footprint is great.
So, which path to take?
Thank you.
Good question.
Pooling indeed saves you initialization time, as you said. But it has another aspect: resource management. And here I am asking you this- just how many groups (read- dedicated threads) do you have?
do they grow dynamically during the execution span of the application?
For example, consider a situation where the answer to this question is yes. new Groups types are added dynamically. In this case, you might not want to dedicate a a thread to each one since there is technically no restrictions on the amount of groups that will be created, you will create a lot of threads and the system will be context switching instead of doing real work.
Threadpooling to the rescue- thread pool allows you to specify a restriction on the maxumal number of threads that could be possibly created, with no regard to load. So the application may deny service from certain requests, but the ones that get through are handled properly, without critically depleting the system resources.
Considering the above, I is very possible that in your case, it is very much OK to have a dedicated
thread for each group!
The same goes for your senior's conviction that it will save memory.. Indeed, a thread takes up memory on the heap, but is it really so much, if it is a predefined amount, say 5. Even 10- it is probably OK. Anyway, you should not use pooling unless you are a-priory and absolutely convinced that you actually have a problem!
Pooling is a design decision, not an architectural one. You can not-pool at the beggining and proceed with optimizations in case you find pooling to be beneficial after you encountered a performance issue.
Considering the serialization of requests (in order execution) it is no matter whether you are using a threadpool or a dedicated thread. The sequential execution is a property of the queue coupled with a single handler thread.
Creating a thread will consume resources, including the default stack per thread (IIR 512Kb, but configurable). So the advantage to pooling is that you incur a limited resource hit. Of course you need to size your pool according to the work that you have to perform.
For your particular problem, I think the key is to actually measure performance/thread usage etc. in each scenario. Unless your running into constraints I perhaps wouldn't worry either way, other than to make sure that you can swap one implementation for another without a major impact on your application. Remember that premature optimisation is the root of all evil. Note that:
"Premature optimization" is a phrase used to describe a situation
where a programmer lets performance considerations affect the design
of a piece of code. This can result in a design that is not as clean
as it could have been or code that is incorrect, because the code is
complicated by the optimization and the programmer is distracted by
optimizing.