java mechanical sympathy trough thread pinning - java

Given we have an application that is heavily polluted with concurrency constructs
multiple techniques are used (different people worked without clear architecture in mind),
multiple questionable locks that are there "just in case", thread safe queues. CPU usage is around 20%.
Now my goal is to optimize it such that it is making better use of caches and generally improve its performance and service time.
I'm considering to pin the parent process to a single core, remove all things that cause membars,
replace all thread safe data structures and replace all locks with some UnsafeReentrantLock
which would simply use normal reference field but take care of exclusive execution
needs...
I expect that we would end up with much more cache friendly application,
since we don't have rapid cache flushes all the time (no membars).
We would have less overhead since we dont need thread safe data structures,
volaties, atomics and replace all sorts of locks with I would assume that service time would improve also,
since we no longer synchronize on multiple thread safe queues...
Is there something that I'm overlooking here?
Maybe blocking operations would have to be paid attention to since they would not show up in that 20% usage?

Related

Shard resources by thread?

I have a (limited) thread pool which executes CPU-bound tasks. I'd like to aggregate some numerical statistics from each of these threads in a single place. Basically: each thread will update some shared stats (e.g. how long its job took) at a very high frequency and, at some much slower interval, a 'stat reader' would query those stats.
My first thought was to use some shared atomics and update them from each thread. This works ok, but in my testing the overhead of the atomics can get pretty high with a lot of contention so I was trying to think of some other alternatives.
My second though was a sort of 'sharding' scheme, where each thread had its own stats object that it could update without requiring any synchronization. The 'stat reader' could then aggregate the stats from each thread into an overall stat value.
My first question is: does the thread sharding scheme make sense? Does something like that exist that I'm reinventing?
My second question is: if the sharding scheme does make sense, I'm trying to think of the best way to map threads to their shard:
1) Use the thread's ID mod some shard value to get a shard index, but I don't think that's reliable as I think the thread id value is shared, so I could get a collision.
2) Adding a thread-local index to the thread, but I don't think that will play nicely with the ExecutorService.
3) I could subclass Thread, but then I'd have to cast it when I wanted to access this which I'd rather avoid, if possible.
4) When the thread is created, create a mapping of its name to its shard. This would work, but there would be a race when creating the threads: one could be looking up its shard while we're adding a new shard to the map, causing concurrency issues.
Wondering if I'm way off-base here and overthinking it (seems like it would be a common problem?) or if one of these schemes does make sense for the use case.
One way to solve this is to use the LongAdder class that avoids the contention that plain old atomics suffer from.
A more hand-written approach would be to create some class that holds the statistics you want to gather for each thread, and then have an array of these objects such that each thread's stats object is in array[thread.getId() % NUM_THREADS]. The reader thread can then traverse the array and gather the stats as it pleases.
The trick to getting this to work efficiently is to avoid false sharing. That is, threads on different cores perform updates on their respective objects but those objects happen to reside on the same cacheline, causing massive amounts of unnecessary cache coherence traffic.
In Java 8, there is the #Contended annotation that you might want to look into. The old way of padding your class with a bunch of long fields doesn't work anymore since unused fields will be optimized away.
I would suggest you use different way: Actor.
The actor model provides a relatively simple but powerful model for designing and implementing applications that can distribute and share work across all system resources—from threads and cores to clusters of servers and data centers. It provides an effective framework for building applications with high levels of concurrency and for increasing levels of resource efficiency. Importantly, the actor model also has well-defined ways for handling errors and failures gracefully, ensuring a level of resilience that isolates issues and prevents cascading failures and massive downtime.
You can turn to Akka i think.

Is using different thread pools for different types of tasks worth the overhead?

I'm designing a class that provides statistical information about groups of Collatz sequences. One of my goals is to be able to process a large number of sequences containing enormous terms (on the scale of hundreds or even thousands of digits) simultaneously, with maximum efficiency.
To this end, I plan on using the best data collection technique for each individual statistic, which means some tasks may be more efficiently dealt with by a ForkJoinPool, others by the standard cached and fixed thread pools provided in Executors. Would the overhead of creating multiple thread pools, or shutting one down and creating another, if I went that route, cost me more than I would save?
Would the overhead of creating multiple thread pools, or shutting one down and creating another, if I went that route, cost me more than I would save?
How could we possibly tell you that?
There is definitely an overhead in shutting down and restarting a thread pool. If any kind. Creating threads is not cheap.
However, we have no way of quantifying how much you save by using different kinds of thread pool. If we can't quantify that it is impossible to advise you on whether your strategy will work ... or not.
(But I think that repeatedly shutting down and recreating thread pools would be a bad idea. The performance impact of an idle pool is minimal.)
This "smells" of premature optimization. (It is like trying to tune the engine of a racing car before you have manufactured the engine block!)
My advice would be to (largely1) forget about performance to start with. For now, focus on getting something that works. Here's what I would do:
Implement the code using the easiest strategy, write test cases, test / debug until it works.
Choose a sample problem or set of problems that is typical of the kind you will be trying to solve
Implement a test harness that allows you to measure the code's performance for the sample problems. (Beware of the standard problems with Java benchmarking ...)
Benchmark your code.
Is it fast enough? Stop NOW.
If not, continue.
Implement one of the alternative strategies, and test / debug.
Benchmark the modified code.
Is it fast enough? Stop NOW.
Is it clear that it doesn't help?. Abandon it, and try another strategy.
Can you tweak it? If so, try that.
Go to 5.
Also, it may be worthwhile implementing the different strategies in such a way that you can tune them or switch between them using command line or config file settings.
As a general rule, it is hard to determine a priori how well any complicated algorithm or strategy is going to perform. Generally speaking, there are too many factors to take into account for a theoretical ... or intuitive ... approach to give a reliable prediction. Benchmarking and tuning is the way to go.
1 - Obviously, if you know that some technique or algorithm will perform badly, and you have a better alternative that is about the same effort to implement ... do the sensible thing.
Since you are only talking about two different types of pools (fork-join and Executor based pools), and you claim that at least some of your tasks are more suited to one type or pool or the other, it is entirely likely that the overhead of using two types of pools is worth it.
After all, you can just keep both types of pools alive and so there is only a one time cost to setting up the pools and creating the threads, while the (apparent) benefit of the two pool types will apply across the entirety of your processing. Since you are doing an "enormous" amount of work even small benefits will eventually add up and overwhelm the one-time costs (which are probably measured in micro-architecture per thread).
Key to this observation is that there is no real ongoing overhead for existing but inactive threads in the pool you aren't using.
Of course, that said, the short answer it "just try both approaches and measure it!".

Simulation thread and data writer thread parallelism

This a general programming question. Let's say I have a thread doing a specific simulation, where speed is quite important. At every iteration I want to extract data from it and write it to a file.
Is it a better practice to hand over the data to a different thread and let the simulation thread focus on his job, or since speed is very important, make the simulation thread do the data recording too without any copying of data. (in my case it is 3-5 deques of integers with a size of 1000-10000)
Firstly it surely depends on how much data we are copying, but what else can it depend on? Can the cost of synchronization and copying be worth? Is it a good practice to create small runnables at each iteration to handle the recording task in case of 50 or more iterations per second?
If you truly want low latency on this stat capturing, and you want it during the simulation itself then two techniques come to mind. They can be used together very effectively. Please note that these two approaches are fairly far from the standard Java trodden path, so measure first and confirm that you need these techniques before abusing them; they can be difficult to implement correctly.
The fastest way to write the data to file during a simulation, without slowing down the simulation is to hand the work off to another thread. However care has to be taken on how the hand off occurs, as a memory barrier in the simulation thread will slow the simulation. Given the writer only cares that the values will come eventually I would consider using the memory barrier that sits behind AtomicLong.lazySet, it requests a thread safe write out to a memory address without blocking for the write to actually become visible to the other thread. Unfortunately direct access to this memory barrier is currently only availble via lazySet or via class sun.misc.Unsafe, which obviously is not part of the public Java API. However that should not be too large of a hurdle as it is on all current JVM implementations and Doug Lea is talking about moving parts of it into the mainstream.
To avoid the slow, blocking file IO that Java uses; make use of a memory mapped file. This lets the OS perform async IO for you on your behalf, and is very efficient. It also supports use of the same memory barrier mentioned above.
For examples of both techniques, I strongly recommend reading the source code to HFT Chronicle by Peter Lawrey. In fact, HFT Chronicle may be just the library for you to use here. It offers a highly efficient and simple to use disk backed queue that can sustain a million or so messages per second.
In my work on a stress-testing HTTP client I stored the stats into an array and, when the array was ready to send to the GUI, I would create a new array for the tester client and hand off the full array to the network layer. This means that you don't need to pay for any copying, just for the allocation of a fresh array (an ultra-fast operation on the JVM, involving hand-coded assembler macros to utilize the best SIMD instructions available for the task).
I would also suggest not throwing yourself head-on into the realms of optimal memory barrier usage; the difference between a plain volatile write and an AtomicReference.lazySet() can only be measurable if your thread does almost nothing else but excercise the memory barrier (at least millions of writes per second). Depending on your target I/O throughput, you may not even need NIO to meet the goal. Better try first with simple, easily maintainable code than dig elbows-deep into highly specialized APIs without a confirmed need for that.

Downsides of structuring all multi-threading CSP-like

Disclaimer: I don't know much about the theoretical background of CSP.
Since I read about it, I tend to structure most of my multi-threading "CSP-like", meaning I have threads waiting for jobs on a BlockingQueue.
This works very well and simplified my thinking about threading a lot.
What are the downsides of this approach?
Can you think of situations where I'm performance-wise better off with a synchronized block?
...or Atomics?
If I have many threads mostly sleeping/waiting, is there some kind of performance impact, except the memory they use? For example during scheduling?
This is one possibly way to designing the architecture of your code to prevent thread issues from even happening, this is however not the only one and sometimes not the best one.
First of all you obviously need to have a series of tasks that can be splitted and put into such a queue, which is not always the case if you for example have to calculate the result of a single yet very straining formula, which just cannot be taken apart to utilize multi-threading.
Then there is the issue if the task at hand is so tiny, that creating the task and adding it into the list is already more expensive than the task itself. Example: You need to set a boolean flag on many objects to true. Splittable, but the operation itself is not complex enough to justify a new Runnable for each boolean.
You can of course come up with solutions to work around this sometimes, for example the second example could be made reasonable for your approach by having each thread set 100 flags per execution, but then this is only a workaround.
You should imagine those ideas for threading as what they are: tools to help you solve your problem. So the concurrent framework and patters using those are all together nothing but a big toolbox, but each time you have a task at hand, you need to select one tool out of that box, because in the end putting in a screw with a hammer is possible, but probably not the best solution.
My recommendation to get more familiar with the tools is, that each time you have a problem that involves threading: go through the tools, select the one you think fits best, then experiment with it until you are satisfied that this specific tool fits the specific task best. Prototyping is - after all - another tool in the box. ;)
What are the downsides of this approach?
Not many. A queue may require more overhead than an uncontended lock - a lock of some sort is required internally by the queue classs to protect it from multiple access. Compared with the advantages of thread-pooling and queued comms in general, some extra overhead does not bother me much.
better off with a synchronized block?
Well, if you absolutely MUST share mutable data between threads :(
is there some kind of performance impact,
Not so anyone would notice. A not-ready thread is, effectively, an extra pointer entry in some container in the kernel, (eg. a queue belonging to a semaphore). Not worth bothering about.
You need synchronized blocks, Atomics, and volatiles whenever two or more threads access mutable data. Keep this to a minimum and it needn't affect your design. There are lots of Java API classes that can handle this for you, such as BlockingQueue.
However, you could get into trouble if the nature of your problem/solution is perverse enough. If your threads try to read/modify the same data at the same time, you'll find that most of your threads are waiting for locks and most of your cores are doing nothing. To improve response time you'll have to let a lot more threads run, perhaps forgetting about the queue and letting them all go.
It becomes a trade off. More threads chew up a lot of CPU time, which is okay if you've got it, and speed response time. Fewer threads use less CPU time for a given amount of work (but what will you do with the savings?) and slow your response time.
Key point: In this case you need a lot more running threads than you have cores to keep all your cores busy.
This sort of programming (multithreaded as opposed to parallel) is difficult and (irreproducible) bug prone, so you want to avoid it if you can before you even start to think about performance. Plus, it only helps noticably if you've got more than 2 free cores. And it's only needed for certain sorts of problems. But you did ask for downsides, and it might pay to know this is out there.

Thread Pool vs Many Individual Threads

I'm in the middle of a problem where I am unable decide which solution to take.
The problem is a bit unique. Lets put it this way, i am receiving data from the network continuously (2 to 4 times per second). Now each data belongs to a different, lets say, group.
Now, lets call these groups, group1, group2 and so on.
Each group has a dedicated job queue where data from the network is filtered and added to its corresponding group for processing.
At first I created a dedicated thread per group which would take data from the job queue, process it and then goes to blocking state (using Linked Blocking Queue).
But my senior suggested that i should use thread pools because this way threads wont get blocked and will be usable by other groups for processing.
But here is the thing, the data im getting is fast enough and the time a thread takes to process it is long enough for the thread to, possibly, not go into blocking mode. And this will also guarantee that data gets processed sequentially (job 1 gets done before job 2), which in pooling, very little chances are, might not happen.
My senior is also bent on the fact that pooling will also save us lots of memory because threads are POOLED (im thinking he really went for the word ;) ). While i dont agree to this because, i personally think, pooled or not each thread gets its own stack memory. Unless there is something in thread pools which i am not aware of.
One last thing, I always thought that pooling helps where jobs appear in a big number for short time. This makes sense because thread spawning would be a performance kill because of the time taken to init a thread is lot more than time spent on doing the job. So pooling helps a lot here.
But in my case group1, group2,...,groupN always remain alive. So if there is data or not they will still be there. So thread spawning is not the issue here.
My senior is not convinced and wants me to go with the pooling solution because its memory footprint is great.
So, which path to take?
Thank you.
Good question.
Pooling indeed saves you initialization time, as you said. But it has another aspect: resource management. And here I am asking you this- just how many groups (read- dedicated threads) do you have?
do they grow dynamically during the execution span of the application?
For example, consider a situation where the answer to this question is yes. new Groups types are added dynamically. In this case, you might not want to dedicate a a thread to each one since there is technically no restrictions on the amount of groups that will be created, you will create a lot of threads and the system will be context switching instead of doing real work.
Threadpooling to the rescue- thread pool allows you to specify a restriction on the maxumal number of threads that could be possibly created, with no regard to load. So the application may deny service from certain requests, but the ones that get through are handled properly, without critically depleting the system resources.
Considering the above, I is very possible that in your case, it is very much OK to have a dedicated
thread for each group!
The same goes for your senior's conviction that it will save memory.. Indeed, a thread takes up memory on the heap, but is it really so much, if it is a predefined amount, say 5. Even 10- it is probably OK. Anyway, you should not use pooling unless you are a-priory and absolutely convinced that you actually have a problem!
Pooling is a design decision, not an architectural one. You can not-pool at the beggining and proceed with optimizations in case you find pooling to be beneficial after you encountered a performance issue.
Considering the serialization of requests (in order execution) it is no matter whether you are using a threadpool or a dedicated thread. The sequential execution is a property of the queue coupled with a single handler thread.
Creating a thread will consume resources, including the default stack per thread (IIR 512Kb, but configurable). So the advantage to pooling is that you incur a limited resource hit. Of course you need to size your pool according to the work that you have to perform.
For your particular problem, I think the key is to actually measure performance/thread usage etc. in each scenario. Unless your running into constraints I perhaps wouldn't worry either way, other than to make sure that you can swap one implementation for another without a major impact on your application. Remember that premature optimisation is the root of all evil. Note that:
"Premature optimization" is a phrase used to describe a situation
where a programmer lets performance considerations affect the design
of a piece of code. This can result in a design that is not as clean
as it could have been or code that is incorrect, because the code is
complicated by the optimization and the programmer is distracted by
optimizing.

Categories

Resources