Separate hashset to run the list at several threads

Separate hashset to run the list at several threads - java

I googled and search here for this question and did not find anything similar to what I´m looking for.
I populated a HashSet with few objects called Person, I need to set four or five threads to search these Person in a huge text, thread seems to be the best solution for the better usage from the hardware.
The doubt is, how can I separate this HashSet and start 4 threads? I tried to create a new HashSet list and start a new thread with this new hashset divided in 4.
It seems to be a good solution but, is there a better way to do it? How can I separate the hashset and send at pieces to 4 or 5 new threads?

Access to a HashSet is O(1) so if you split it across multiple threads, it won't go any faster. You are better off attempting to split the file of searching is expensive. However if its efficient enough, one thread will be optimal.
It is worth remembering that using all the cores on your machine can mean your program is slower. If you just want to use up all the CPU on you machine, you can create a thread pool which does nothing but use up all the CPU on your machine.

You can implement a producer-consumer scheme: have a single thread read the values from the hash set one by one and put them in a queue which is then processesed by several worker threads. You can use the ExecutorService class to manage the workers.
Edit: Here's what you can do:
Define your worker class:
public class Worker implements Runnable {
private Person p;
public Worker(Person p) {
this.p = p;
}
public void run() {
// search for p
}
}
In the main thread:
ExecutorService s = Executors.newCachedThreadPool();
for(Person p: hashSet) {
s.submit(new Worker(p));
}

A couple of things to consider:
1) You could use the same HashSet, but you will need to synchronize it (wrap the calls to it with a synchronized block. But if all you are doing is looking up things in the hash, being multi-threaded will not buy you much.
2) If you want to split the HashSet, then you can consider a split on key ranges. So for example if you are searching for a name, names that start with A-F go in HashSet1, G-L HashSet2, etc. This way your searches can be completely parallel.

You cane iterate through the hash set using Iterator. & while iterating fetch the value and create a thread and fire it.
Else
you can use ExecutorService API where simultaneous tasks can be run in parallel.

Related

Parallel lock-free ascending id generation

I have a map which should associate Strings with an id. There must not be gaps between ids and they must be unique Integers from 0 to N.
Request always comes with two Strings of which one, both or none may have been already indexed.
The map is built in parallel from the ForkJoin pool and ideally i would like to avoid explicit synchronized blocks. I am looking for an optimal way to maximize throughput with or without locking.
I don't see how to use AtomicInteger without creating gaps in sequence for the keys which were already present in the map.
public class Foo {
private final Map<String, Integer> idGenerator = new ConcurrentHashMap<>();
// invoked from multiple threads
public void update(String key1, String key2) {
idGenerator.dosomething(key, ?) // should save the key and unique id
idGenerator.dosomething(key2, ?) // should save the key2 and its unique id
Bar bar = new Bar(idGenerator.get(key), idGenerator.get(key2));
// ... do something with bar
}
}
I think that size() method combined with merge() might solve the problem but i cannot quite convince myself of that. Could anyone suggest an approach for this problem?
EDIT
Regarding duplicate flag, this cannot be solved with AtomicInteger.incrementAndGet() as suggested in the linked answer. If i did this blindly for every String there would be gaps in sequences. There is a need for compound operation which checks if the key exists and only then generates id.
I was looking for a way to implement such compound operation via Map API.
The second provided answer goes against requirements i have specifically laid out in the question.

There is not a way to do it exactly the way you want it -- ConcurrentHashMap is not in and of itself, lock-free. However, you can do it atomically without having to do any explicit lock management by using the java.util.Map.computeIfAbsent function.
Here's a code sample in the style of what you provided that should get you going.
ConcurrentHashMap<String, Integer> keyMap = new ConcurrentHashMap<>();
AtomicInteger sequence = new AtomicInteger();
public void update(String key1, String key2) {
Integer id1 = keyMap.computeIfAbsent(key1, s -> sequence.getAndIncrement());
Integer id2 = keyMap.computeIfAbsent(key2, s -> sequence.getAndIncrement());
Bar bar = new Bar(id1, id2);
// ... do something with bar
}

I'm not sure you can do exactly what you want. You can batch some updates, though, or do the checking separately from the enumerating / adding.
A lot of this answer is assuming that order isn't important: you need all the strings given a number, but reordering even within a pair is ok, right? Concurrency could already cause reordering of pairs, or for members of a pair not to get contiguous numbers, but reordering could lead to the first of a pair getting a higher number.
latency is not that important. This application should chew large amount of data and eventually produce output. Most of the time there should be a search hit in a map
If most searches hit, then we mostly need read throughput on the map.
A single writer thread might be sufficient.
So instead of adding directly to the main map, concurrent readers can check their inputs, and if not present, add them to a queue to be enumerated and added to the main ConcurrentHashMap. The queue could be a simple lockless queue, or could be another ConCurrentHashMap to also filter duplicates out of not-yet-added candidates. But probably a lockless queue is good.
Then you don't need an atomic counter, or have any problems with 2 threads incrementing the counter twice when they see the same string before either of them can add it to the map. (Because otherwise that's a big problem.)
If there's a way for a writer to lock the ConcurrentHashMap to make a batch of updates more efficient, that could be good. But if the hit rate is expected to be quite high, you really want other reader threads to keep filtering duplicates as much as possible while we're growing it instead of pausing that.
To reduce contention between the main front-end threads, you could have multiple queues, like maybe each thread has a single-producer / single-consumer queue, or a group of 4 threads running on a pair of physical cores shares one queue.
The enumerating thread reads from all of them.
In a queue where readers don't contend with writers, the enumerating thread has no contention. But multiple queues reduce contention between writers. (The threads writing these queues are the threads that access the main ConcurrentHashMap read-only, where most CPU time will be spent if hit-rates are high.)
Some kind of read-copy-update (RCU) data structure might be good, if Java has that. It would let readers keep filtering out duplicates at full speed, while the enumerating thread constructs a new table with with a batch of insertions done, with zero contention while it's building the new table.
With a 90% hit rate, one writer thread could maybe keep up with 10 or so reader threads that filter new keys against the main table.
You might want to set some queue-size limit to allow for back-pressure from the single writer thread. Or if you have many more cores / threads than a single writer can keep up with, when maybe some kind of concurrent set to let the multiple threads eliminate duplicates before numbering is helpful.
Or really, if you can just wait until the end to number everything, that would be a lot simpler, I think.
I thought about maybe trying to number with room for error on race conditions, and then going back to fix things up, but that probably isn't better.

Java multithreading for the purpose of simulating data

So I am currently creating a data analytics and predictive program, and for testing purposes, I am simulating large amounts of data (in the range of 10,000 - 1,000,000) "trials". The data is a simulated Match for a theoretical game. Each Match has rounds. The basic psudocode for the program is this:
main(){
data = create(100000);
saveToFile(data);
}
Data create(){
Data returnData = new Data(playTestMatch());
}
Match playTestMatch(){
List<Round> rounds = new List<Round>();
while(!GameFinished){
rounds.add(playTestRound());
}
Match returnMatch = new Match(rounds);
}
Round playTestRound(){
//Do round stuff
}
Right now, I am wondering whether I can handle the simulation of these rounds over multiple threads to speed up the process. I am NOT familiar with the theory behind multithreading, so would someone please either help me accomplish this, OR explain to me why this won't work (won't speed up the process). Thanks!

If you are new to Java multi-threading, this explanation might seem a little difficult to understand at first but I'll try and make it seem as simple as possible.
Basically I think generally whenever you have large datasets, running operations concurrently using multiple threads does significantly speed up the process as oppose to using a single threaded approach, but there are exceptions of course.
You need to think about three things:
Creating threads
Managing Threads
Communicating/sharing results computed by each thread with main thread
Creating Threads:
Threads can be created manually extending the Thread class or you can use Executors class.
I would prefer the Executors class to create threads as it allows you to create a thread pool and does the thread management for you. That is it will allow you to re-use existing threads that are idle in the thread pool, thus reducing memory footprint of the application.
You also have to look at ExecutorService Interface as you will be using it to excite your tasks.
Managing threads:
Executors/Executors service does a great job of managing threads automatically, so if you use it you don't have to worry about thread management much.
Communication: This is the key part of the entire process. Here you have to consider in great detail about thread safety of your app.
I would recommend using two queues to do the job, a read queue to read data off and write queue to write data to.
But if you are using a simple arraylist make sure that you synchronize your code for thread safety by enclosing the arraylist in a synchronized block
synchronized(arrayList){
// do stuff
}

If your code is thread-safe and you can split the task into discrete chunks that do not rely on each other then it is relatively easy. Make the class that does the work Callable and add the chunks of work to a List, and then use ExecutorService, like this:
ArrayList<Simulation> SL=new ArrayList<Simulation>();
for(int i=0; i<chunks; i++)
SL.add(new Simulation(i));
ExecutorService executor=Executors.newFixedThreadPool(nthreads);//how many threads
List<Future<Result>> results=null;
try {
results = executor.invokeAll(SL);
} catch (InterruptedException e) {
e.printStackTrace();
}
executor.shutdown();
for(Future<Result> result:results)
result.print();
So, Simulation is callable and returns a Result, results is a List which gets filled when executor.invokeAll is called with the ArrayList of simulations. Once you've got your results you can print them or whatever. Probably best to set nthreads equal to the number of cores you available.

Multiple message listeners to single data store. Efficient design

I have a data store that is written to by multiple message listeners. Each of these message listeners can also be in the hundreds of individual threads.
The data store is a PriorityBlockingQueue as it needs to order the inserted objects by a timestamp. To make checking of the queue of items efficient rather than looping over the queue a concurrent hashmap is used as a form of index.
private Map<String, SLAData> SLADataIndex = new ConcurrentHashMap<String, SLAData>();;
private BlockingQueue<SLAData> SLADataQueue;
Question 1 is this a acceptable design or should I just use the single PriorityBlockingQueue.
Each message listener performs an operation, these listeners are scaled up to multiple threads.
Insert Method so it inserts into both.
this.SLADataIndex.put(dataToWrite.getMessageId(), dataToWrite);
this.SLADataQueue.add(dataToWrite);
Update Method
this.SLADataIndex.get(messageId).setNodeId(
updatedNodeId);
Delete Method
SLATupleData data = this.SLADataIndex.get(messageId);
//remove is O(log n)
this.SLADataQueue.remove(data);
// remove from index
this.SLADataIndex.remove(messageId);
Question Two Using these methods is this the most efficient way? They have wrappers around them via another object for error handling.
Question Three Using a concurrent HashMap and BlockingQueue does this mean these operations are thread safe? I dont need to use a lock object?
Question Four When these methods are called by multiple threads and listeners without any sort of synchronized block, can they be called at the same time by different threads or listeners?

Question 1 is this a acceptable design or should I just use the single PriorityBlockingQueue.
Certainly you should try to use a single Queue. Keeping the two collections in sync is going to require a lot more synchronization complexity and worry in your code.
Why do you need the Map? If it is just to call setNodeId(...) then I would have the processing thread do that itself when it pulls from the Queue.
// processing thread
while (!Thread.currentThread().isInterrupted()) {
dataToWrite = queue.take();
dataToWrite.setNodeId(myNodeId);
// process data
...
}
Question Two Using these methods is this the most efficient way? They have wrappers around them via another object for error handling.
Sure, that seems fine but, again, you will need to do some synchronization locking otherwise you will suffer from race conditions keeping the 2 collections in sync.
Question Three Using a concurrent HashMap and BlockingQueue does this mean these operations are thread safe? I dont need to use a lock object?
Both of those classes (ConcurrentHashMap and the BlockingQueue implementations) are thread-safe, yes. BUT since there are two of them, you can have race conditions where one collection has been updated but the other one has not. Most likely, you will have to use a lock object to ensure that both collections are properly kept in sync.
Question Four When these methods are called by multiple threads and listeners without any sort of synchronized block, can they be called at the same time by different threads or listeners?
That's a tough question to answer without seeing the code in question. For example. someone might be calling Insert(...) and has added it to the Map but not the queue yet, when another thread else calls Delete(...) and the item would get found in the Map and removed but the queue.remove() would not find it in the queue since the Insert(...) has not finished in the other thread.

threads accessing non-synchronised methods in Java

can I ask to explain me how threads and synchronisation works in Java?
I want to write a high-performance application. Inside this application, I read a data from files into some nested classes, which are basically a nut-shell around HashMap.
After the data reading is finished, I start threads which need to go through the data and perform different checks on it. However, threads never change the data!
If I can guarantee (or at least try to guarantee;) that my threads never change the data, can I use them calling non-synchronised methods of objects containing data?
If multiple threads access the non-synchronised method, which does not change any class field, but has some internal variables, is it safe?
artificial example:
public class Data{
// this hash map is filled before I start threads
protected Map<Integer, Spike> allSpikes = new HashMap<Integer, Spike>();
public HashMap returnBigSpikes(){
Map<Integer, Spike> bigSpikes = new HashMap<Integer, Spike>();
for (Integer i: allSpikes.keySet()){
if (allSpikes.get(i).spikeSize > 100){
bigSpikes.put(i,allSpikes.get(i));
}
}
return bigSpikes;
}
}
Is it safe to call a NON-synchronised method returnBigSpikes() from threads?
I understand now that such use-cases are potentially very dangerous, because it's hard to control, that data (e.g., returned bigSpikes) will not be modified. But I have already implemented and tested it like this and want to know if I can use results of my application now, and change the architecture later...
What happens if I make the methods synchronised? Will be the application slowed down to 1 CPU performance? If so, how can I design it correctly and keep the performance?
(I read about 20-40 Gb of data (log messages) into the main memory and then run threads, which need to go through the all data to find some correlation in it; each thread becomes only a part of messages to analyse; but for the analysis, the thread should compare each message from its part with many other messages from data; that's why I first decided to allow threads to read data without synchronisation).
Thank You very much in advance.

If allSpikes is populated before all the threads start, you could make sure it isn't changed later by saving it as an unmodifiable map.
Assuming Spike is immutable, your method would then be perfectly safe to use concurrently.

In general, if you have a bunch of threads where you can guarantee that only one thread will modify a resource and the rest will only read that resource, then access to that resource doesn't need to be synchronised. In your example, each time the method returnBigSpikes() is invoked it creates a new local copy of bigSpikes hashmap, so although you're creating a hashmap it is unique to each thread, so no sync'ing problems there.

As long as anything practically immutable (eg. using final keyword) and you use an unmodifiableMap everything is fine.
I would suggest the following UnmodifiableData:
public class UnmodifiableData {
final Map<Integer,Spike> bigSpikes;
public UnmodifiableData(Map<Integer,Spike> bigSpikes) {
this.bigSpikes = Collections.unmodifiableMap(new HashMap<>(bigSpikes));
}
....
}

Your plan should work fine. You do not need to synchronize reads, only writes.
If, however, in the future you wish to cache bigSpikes so that all threads get the same map then you need to be more careful about synchronisation.

If you use ConcurrentHashMap, it will do all syncronization work for you. Its bettr, then making synronization around ordinary HashMap.

Since allSpikes is initialized before you start threads it's safe. Concurrency problems appear only when a thread writes to a resource and others read from it.

How to Multi-threading this scenario of problem?

I would like to make a simulation of a distributed system, in which, I should make a research for information(supplies) in a distributed (parallel if I could!!) way, for example I have the following class:
public class Group{
public int identifier;
public int[] members;
public String name;
public String[] supplies;
public int[] neighbors;}
There are many groups, each one has a name and consists of a list of members, neighbors and supplies, each member has some information and list to other groups that may contain pertinent information and supplies, and so on.
1- I want to make a research for supplies, First: inside one group, if I do not find the required supply, I should make a search inside all groups which are neighbors to the this group, I think to make this using Multi-threading, I mean, if the search was failed I should make a search inside all the other neighbors at the same time using multiple threads, each one take in consideration one neighbor, If I have 10 neighbors then 10 threads should be created....
2- Now, if I want to begin the re-search at the first time with several groups, I mean to begin with 3 or 4 groups or more, each one look for a different supply, or the same....
+ one group which invoke the search could be a neighbor for another group...
So, my question is How to achieve this scenario using threads ?
PS.I have a machine with a single processor with one core, and I do not care about a time of execution (the overhead), all I want is to simulate this problem...
Thanks for every response, and best regards.

Since you have a CPU bound problem, the optimal number of threads to use is likely to be the number of cores you have. I would ensure each thread has about 100 micro-seconds of work or you could find you have more over head than useful work. e.g. you might find that searching 10K nodes is about 100 us work. If you are not careful, a multi-threaded application can be many times slower than a single threaded one.
So I would find a way to divide up the work so you have about 1K to 100K nodes for each thread and limit your concurrency to the number of core you have.

I could not understand the second requirement, but for the first, here is a possible approach. Before that though, technically your process is not completely CPU bound, it is also I/O bound (network) too. So, please don't assume that making it muti-threaded will provide the required speedup you are looking for. I am assuming that your development environment is uni-processor and single core, but your deployment environment may not be.
Back to the suggestion. I would create a GroupPool class that has a pool of threads that can go scout for information. The number of threads will be configurable via a runtime config parameter. You can create a factory class which reads this parameter from a config file and creates a pool of runnable objects.
Each of these objects represent one connection to the neighboring node. [TODO] You did not mention if you'd like to recurse on the supplier nodes i.e. if you don't find the information in a supplier node, do you want to search the supplier, the supplier's suppliers etc. If so, you will have the problem of cycle detection. Once these thread objects scout for information and find it, they update a semaphore on the factory object (you might want to move this to a separate object as that will be a better design), and also send the supplier id (see, a separate object does make sense)
You can have a listener for this modified semaphore and as soon as the value changes, you know you found your information and get the supplier id from that object. Once you get your information, you can send a notification to the thread-pool to shutdown the runnable objects as you already found your information.
Based on whether you are looking for a binary answer (find data and any supplier is ok) and if you want to recurse, the complexity of the above will increase.
Hope this helps in you trying to design the structure for your problem.

I don't see any performance advantage to multi-threading this on a single CPU machine. This is because only 1 thread will be able to run at a time and there will be switching time between threads, thus it will probably actually take more time to find a group with the desired resource.
Personally, I'd just iterate through the first group's neighbors and check them for resources. Then, if the resources were not found, I'd call the search on each of the sub-groups, passing in the list of groups that were already checked, so it can skip groups that have already been checked. Something like:
public Group searchForGroupWithResource(Resource resource){
List<Group> groupsToCheck = new ArrayList<Group>();
groupsToCheck.add(this);
int currentIndex = 0;
while(currentIndex < groupsToCheck.size()){
Group currentGroup = groupsToCheck.get(currentIndex);
if(currentGroup.hasResource(resource)){
return currentGroup;
}
groupsToCheck.addAll(currentGroup.getNeighbors(groupsToCheck));
currentIndex++;
}
return null;
}
public List<Group> getNeighbors(List<Group> excludeGroups){
//Get non-excluded neighbors
List<Group> returnNeighbors = new ArrayList<Group>();
for(Group neighbor : neighbors){
boolean includeGroup = true;
for(Group excludeGroup : excludeGroups){
if(excludeGroup.equals(neighbor)){
includeGroup = false;
break;
}
}
if(includeGroup){
returnNeighbors.add(neighbor);
}
}
return returnNeighbors;
}
Note: If you still decide to go for the multi-threading, I would suggest a common object that stores information about the search that is accessible to all threads. This would specify the Groups that were checked (so you don't check the same group twice) and whether the required supplies were found (so you can stop checking resources).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.