In one of my use cases I need to fetch the data from multiple nodes. Each node maintains a range (partition) of data. The goal is to read the data as fast as possible. Constraints are, cardinality of a partition is not known before hand. Using work sharing approach, I could split the partitions into sub-partitions and fetch the data in parallel. One drawback with this approach is, it is possible that one thread could fetch lot of data and take more time while the other thread could finish faster. The other approach is to use work stealing where we can break the partitions into much smaller ranges and use ForkJoinPool. The drawback with this approach is, if the partition is sparse, we could make many round trips to the server to realize there is not data for a sub-partition.
The question I've is, if I want to use ForkJoinPool, where the tasks can do some I/O operations, how do I do that? From the documentation of the FJ pool and from the best practices I read so far, it appears like FJ pool is not good for blocking IO operations. If I want to use non-blocking IO, how can I do that?
Related
For monitoring purposes I want to maintain a sliding window of operations and results. I have found a data structure to do this quickly but it is not thread-safe. I also don’t care if I drop some data because overall the picture should still be ok, so what’s the fastest way to do this in Java that can scale to hundreds of threads? Is a singleton semaphore with a time-out on tryAcquire the best way? Or would a single threaded executor service that discarded messages when the queue was full be faster
I have a (limited) thread pool which executes CPU-bound tasks. I'd like to aggregate some numerical statistics from each of these threads in a single place. Basically: each thread will update some shared stats (e.g. how long its job took) at a very high frequency and, at some much slower interval, a 'stat reader' would query those stats.
My first thought was to use some shared atomics and update them from each thread. This works ok, but in my testing the overhead of the atomics can get pretty high with a lot of contention so I was trying to think of some other alternatives.
My second though was a sort of 'sharding' scheme, where each thread had its own stats object that it could update without requiring any synchronization. The 'stat reader' could then aggregate the stats from each thread into an overall stat value.
My first question is: does the thread sharding scheme make sense? Does something like that exist that I'm reinventing?
My second question is: if the sharding scheme does make sense, I'm trying to think of the best way to map threads to their shard:
1) Use the thread's ID mod some shard value to get a shard index, but I don't think that's reliable as I think the thread id value is shared, so I could get a collision.
2) Adding a thread-local index to the thread, but I don't think that will play nicely with the ExecutorService.
3) I could subclass Thread, but then I'd have to cast it when I wanted to access this which I'd rather avoid, if possible.
4) When the thread is created, create a mapping of its name to its shard. This would work, but there would be a race when creating the threads: one could be looking up its shard while we're adding a new shard to the map, causing concurrency issues.
Wondering if I'm way off-base here and overthinking it (seems like it would be a common problem?) or if one of these schemes does make sense for the use case.
One way to solve this is to use the LongAdder class that avoids the contention that plain old atomics suffer from.
A more hand-written approach would be to create some class that holds the statistics you want to gather for each thread, and then have an array of these objects such that each thread's stats object is in array[thread.getId() % NUM_THREADS]. The reader thread can then traverse the array and gather the stats as it pleases.
The trick to getting this to work efficiently is to avoid false sharing. That is, threads on different cores perform updates on their respective objects but those objects happen to reside on the same cacheline, causing massive amounts of unnecessary cache coherence traffic.
In Java 8, there is the #Contended annotation that you might want to look into. The old way of padding your class with a bunch of long fields doesn't work anymore since unused fields will be optimized away.
I would suggest you use different way: Actor.
The actor model provides a relatively simple but powerful model for designing and implementing applications that can distribute and share work across all system resources—from threads and cores to clusters of servers and data centers. It provides an effective framework for building applications with high levels of concurrency and for increasing levels of resource efficiency. Importantly, the actor model also has well-defined ways for handling errors and failures gracefully, ensuring a level of resilience that isolates issues and prevents cascading failures and massive downtime.
You can turn to Akka i think.
In event processing a function puts values into a collection and another removes from the same collection. The items should be placed inside the collection in the order they received from the source (sockets) and read in the same way or else the results will change.
Queue is the collection most people recommend but at the same time, is the queue blocked when an item is being added and hence the other function has to wait until the adding is completed making it inefficient and the operational latency increases over time.
For example, one thread reads from a queue and another writes to the same queue. Either one operation performs at a time on queue until it releases a lock. Is there any data structure that avoids this.
ConcurrentLinkedQueue is one of the examples. Please see other classes from java.util.concurrent.
There are even more performant third party libraries for specific cases, e.g. LMAX Disruptor
In fact, the LinkedBlockingQueue is the easiest to use in many cases because of its blocking put and take methods, which wait until there's an item to take, or space for another item to insert in case an upper size limit named capacity has been activated. Setting a capacity is optional, and without one, the queue can grow indefinitely.
The ArrayBlockingQueue, on the other hand, is the most efficient and beautiful of them, it internally uses a ring buffer and therefore must have an fixed capacity. It is way faster than the LinkedBlockingQueue, yet far from the maximum throughput you can achieve with a disruptor :)
In both cases, blocking is purely optional on both sides. The non-blocking API of all concurrent queues is also supported. The blocking and non-blocking APIs can be mixed.
In many cases, the queue is not the bottleneck, and when it really is, using a disruptor is often the sensible thing to do. It is not a queue but a ring buffer shared between participating threads with different roles, i.e. typically one producer, n workers, and one consumer. A bit more cumbersome to set up but speeds around 100 million transactions per second are possible on modern hardware because it does not require expensive volatile variables but relies on more subtle ways of serialising reads and writes that are machine dependent (you basically need to write parts of such a thing in assembler) :)
I have a requirement where I need to read text file then transform it and write it to some other file. I wish to do this in parallel fashion like one thread for read, one for transform and another for write.
Now to share data between threads I need some channel, I was thinking to use BlockingQueue for this but would like to explore some other (better) alternatives if available.
Guava has a EventBus but not sure whether this is a good fit for the requirement. What other alternatives are available and which one is best from performance point of view.
Unless your transform step is really intensive, this is probably a waste of time.
Think of it this way. What are you asking for?
You're asking for something that
Takes an incoming stream of data
Copies it to another thread
Presents it to that thread as an incoming stream of data
What data structure best represents an incoming stream of data for step 3? (Hint: it's the InputStream you started with!)
What value do the first two steps add? The "transform" thread can read from disk just as fast as it could read from disk through another thread. Adding the thread inbetween does not speed up the disk read.
You would start to consider adding another thread when
Your problem can be usefully divided into independent pieces of work (say, each thread works on a chunk of text
The cost of splitting the problem into those pieces of work is significantly smaller than the overhead of adding an additional thread and coordinating between them (which is small, but not free!)
The problem requires more resources than a single CPU can provide (a thread gives you access to more CPU resources, but doesn't provide much value in terms of I/O throughput)
Our company is running a Java application (on a single CPU Windows server) to read data from a TCP/IP socket and check for specific criteria (using regular expressions) and if a match is found, then store the data in a MySQL database. The data is huge and is read at a rate of 800 records/second and about 70% of the records will be matching records, so there is a lot of database writes involved. The program is using a LinkedBlockingQueue to handle the data. The producer class just reads the record and puts it into the queue, and a consumer class removes from the queue and does the processing.
So the question is: will it help if I use multiple consumer threads instead of a single thread? Is threading really helpful in the above scenario (since I am using single CPU)? I am looking for suggestions on how to speed up (without changing hardware).
Any suggestions would be really appreciated. Thanks
Simple: Try it and see.
This is one of those questions where you argue several points on either side of the argument. But it sounds like you already have most of the infastructure set up. Just create another consumer thread and see if the helps.
But the first question you need to ask yourself:
What is better?
How do you measure better?
Answer those two questions then try it.
Can the single thread keep up with the incoming data? Can the database keep up with the outgoing data?
In other words, where is the bottleneck? If you need to go multithreaded then look into the Executor concept in the concurrent utilities (There are plenty to choose from in the Executors helper class), as this will handle all the tedious details with threading that you are not particularly interested in doing yourself.
My personal gut feeling is that the bottleneck is the database. Here indexing, and RAM helps a lot, but that is a different question.
It is very likely multi-threading will help, but it is easy to test. Make it a configurable parameter. Find out how many you can do per second with 1 thread, 2 threads, 4 threads, 8 threads, etc.
First of all:
It is wise to create your application using the java 5 concurrent api
If your application is created around the ExecutorService it is fairly easy to change the number of threads used. For example: you could create a threadpool where the number of threads is specified by configuration. So if ever you want to change the number of threads, you only have to change some properties.
About your question:
- About the reading of your socket: as far as i know, it is not usefull (if possible at all) to have two threads read data from one socket. Just use one thread that reads the socket, but make the actions in that thread as few as possible (for example read socket - put data in queue -read socket - etc).
- About the consuming of the queue: It is wise to construct this part as pointed out above, that way it is easy to change number of consuming threads.
- Note: you cannot really predict what is better, there might be another part that is the bottleneck, etcetera. Only monitor / profiling gives you a real view of your situation. But if your application is constructed as above, it is really easy to test with different number of threads.
So in short:
- Producer part: one thread that only reads from socket and puts in queue
- Consumer part: created around the ExecutorService so it is easy to adapt the number of consuming threads
Then use profiling do define the bottlenecks, and use A-B testing to define the optimal numbers of consuming threads for your system
As an update on my earlier question:
We did run some comparison tests between single consumer thread and multiple threads (adding 5, 10, 15 and so on) and monitoring the que size of yet-to-be processed records. The difference was minimal and what more.. the que size was getting slightly bigger after the number of threads was crossing 25 (as compared to running 5 threads). Leads me to the conclusion that the overhead of maintaining the threads was more than the processing benefits got. Maybe this could be particular to our scenario but just mentioning my observations.
And of course (as pointed out by others) the bottleneck is the database. That was handled by using the multiple-insert statement in mySQL instead of single inserts. If we did not have that to start with, we could not have handled this load.
End result: I am still not convinced on how multi-threading will give benefit on processing time. Maybe it has other benefits... but I am looking only from a processing-time factor. If any of you have experience to the contrary, do let us hear about it.
And again thanks for all your input.
In your scenario where a) the processing is minimal b) there is only one CPU c) data goes straight into the database, it is not very likely that adding more threads will help. In other words, the front and the backend threads are I/O bound, with minimal processing int the middle. That's why you don't see much improvement.
What you can do is to try to have three stages: 1st is a single thread pulling data from the socket. 2nd is the thread pool that does processing. 3rd is a single threads that serves the DB output. This may produce better CPU utilization if the input rate varies, at the expense of temporarily growth of the output queue. If not, the throughput will be limited by how fast you can write to the database, no matter how many threads you have, and then you can get away with just a single read-process-write thread.