Effective thread-safe Java List impl when traversals match mutations

Effective thread-safe Java List impl when traversals match mutations - java

I have a number of threads that will be consuming messages from a broker and processing them. Each message is XML containing, amongst other elements, an alpha-numeric <itemId>WI354DE48</itemId> element that serves as a unique ID for the item to "process". Due to criteria I can't control or change, it is possible for items/messages to be duplicated on the broker queue that thhese threads are consuming from. So the same item (with an ID of WI354DE48), might only be sent to the queue once, or it might get sente 100 times. Regardless, I can only allow the item to be processed once; so I need a way to prevent Thread A from processing a duplicated item that Thread B already processed.
I'm looking to use a simple thread-safe list that can be shared by all threads (workers) to act as a cache mechanism. Each thread will be given the same instance of a List<String>. When each worker thread consumes a message, it checks to see if the itemId (a String) exists on the list. If it doesn't then no other worker has processed the item. In this case, the itemID is added to the list (locking/caching it), and then the item is processed. If the itemId does already exist on the list, then another worker has already processed the item, so we can ignore it. Simple, yet effective.
It's obviously then paramount to have a thread-safe list implementation. Note that the only two methods we will ever be calling on this list will be:
List#contains(String) - traversing/searching the list
List#add(String) - mutating the list
...and its important to note that we will be calling both methods with about the same frequency. Only rarely will contains() return true and prevent us from needing to add the ID.
I first thought that CopyOnWriteArrayList was my best bet, but after reading the Javadocs, it seems like each worker would just wind up with its own thread-local copy of the list, which isn't what I want. I then looked into Collections.synchronizedList(new ArrayList<String>), and that seems to be a decent bet:
List<String> processingCache = Collection.synchronizedList(new ArrayList<String>());
List<Worker> workers = getWorkers(processingCache); // Inject the same list into all workers.
for(Worker worker : workers)
executor.submit(worker);
// Inside each Worker's run method:
#Override
public void run() {
String itemXML = consumeItemFromBroker();
Item item = toItem(itemXML);
if(processingCache.contains(item.getId())
return;
else
processingCache.add(item.getId());
... continue processing.
}
Am I on track with Collections.synchronizedList(new ArrayList<String>), or am I way off base? Is there a more efficient thread-safe List impl given my use case, and if so, why?

Collections.synchronizedList is very basic, it just marks all methods as synchronized.
This will work but only under some specific assumptions, namely that you never carry out multiple accesses to the List, i.e.
if(!list.contains(x))
list.add(x);
Is not thread safe as the monitor is released between the two calls.
It can also be somewhat slow if you have many reads and few writes as all threads acquire an exclusive lock.
You can look at the implementations in the java.util.concurrent package, there are several options.
I would recommend using a ConcurrentHashMap with dummy values.
The reason for the recommendation is that the ConcurrentHashMap has synchronized key groups so if you have a good hashing algorithm (and String does) you can actually get a massive amount of concurrent throughput.
I would prefer this over a ConcurrentSkipListSet as it doesn't guarantee ordering and therefore you lose that overhead.
Of course with threading it's never entirely obvious where the bottlenecks are so I would suggest trying both and seeing which gives you better performance.

Related

Parallel lock-free ascending id generation

I have a map which should associate Strings with an id. There must not be gaps between ids and they must be unique Integers from 0 to N.
Request always comes with two Strings of which one, both or none may have been already indexed.
The map is built in parallel from the ForkJoin pool and ideally i would like to avoid explicit synchronized blocks. I am looking for an optimal way to maximize throughput with or without locking.
I don't see how to use AtomicInteger without creating gaps in sequence for the keys which were already present in the map.
public class Foo {
private final Map<String, Integer> idGenerator = new ConcurrentHashMap<>();
// invoked from multiple threads
public void update(String key1, String key2) {
idGenerator.dosomething(key, ?) // should save the key and unique id
idGenerator.dosomething(key2, ?) // should save the key2 and its unique id
Bar bar = new Bar(idGenerator.get(key), idGenerator.get(key2));
// ... do something with bar
}
}
I think that size() method combined with merge() might solve the problem but i cannot quite convince myself of that. Could anyone suggest an approach for this problem?
EDIT
Regarding duplicate flag, this cannot be solved with AtomicInteger.incrementAndGet() as suggested in the linked answer. If i did this blindly for every String there would be gaps in sequences. There is a need for compound operation which checks if the key exists and only then generates id.
I was looking for a way to implement such compound operation via Map API.
The second provided answer goes against requirements i have specifically laid out in the question.

There is not a way to do it exactly the way you want it -- ConcurrentHashMap is not in and of itself, lock-free. However, you can do it atomically without having to do any explicit lock management by using the java.util.Map.computeIfAbsent function.
Here's a code sample in the style of what you provided that should get you going.
ConcurrentHashMap<String, Integer> keyMap = new ConcurrentHashMap<>();
AtomicInteger sequence = new AtomicInteger();
public void update(String key1, String key2) {
Integer id1 = keyMap.computeIfAbsent(key1, s -> sequence.getAndIncrement());
Integer id2 = keyMap.computeIfAbsent(key2, s -> sequence.getAndIncrement());
Bar bar = new Bar(id1, id2);
// ... do something with bar
}

I'm not sure you can do exactly what you want. You can batch some updates, though, or do the checking separately from the enumerating / adding.
A lot of this answer is assuming that order isn't important: you need all the strings given a number, but reordering even within a pair is ok, right? Concurrency could already cause reordering of pairs, or for members of a pair not to get contiguous numbers, but reordering could lead to the first of a pair getting a higher number.
latency is not that important. This application should chew large amount of data and eventually produce output. Most of the time there should be a search hit in a map
If most searches hit, then we mostly need read throughput on the map.
A single writer thread might be sufficient.
So instead of adding directly to the main map, concurrent readers can check their inputs, and if not present, add them to a queue to be enumerated and added to the main ConcurrentHashMap. The queue could be a simple lockless queue, or could be another ConCurrentHashMap to also filter duplicates out of not-yet-added candidates. But probably a lockless queue is good.
Then you don't need an atomic counter, or have any problems with 2 threads incrementing the counter twice when they see the same string before either of them can add it to the map. (Because otherwise that's a big problem.)
If there's a way for a writer to lock the ConcurrentHashMap to make a batch of updates more efficient, that could be good. But if the hit rate is expected to be quite high, you really want other reader threads to keep filtering duplicates as much as possible while we're growing it instead of pausing that.
To reduce contention between the main front-end threads, you could have multiple queues, like maybe each thread has a single-producer / single-consumer queue, or a group of 4 threads running on a pair of physical cores shares one queue.
The enumerating thread reads from all of them.
In a queue where readers don't contend with writers, the enumerating thread has no contention. But multiple queues reduce contention between writers. (The threads writing these queues are the threads that access the main ConcurrentHashMap read-only, where most CPU time will be spent if hit-rates are high.)
Some kind of read-copy-update (RCU) data structure might be good, if Java has that. It would let readers keep filtering out duplicates at full speed, while the enumerating thread constructs a new table with with a batch of insertions done, with zero contention while it's building the new table.
With a 90% hit rate, one writer thread could maybe keep up with 10 or so reader threads that filter new keys against the main table.
You might want to set some queue-size limit to allow for back-pressure from the single writer thread. Or if you have many more cores / threads than a single writer can keep up with, when maybe some kind of concurrent set to let the multiple threads eliminate duplicates before numbering is helpful.
Or really, if you can just wait until the end to number everything, that would be a lot simpler, I think.
I thought about maybe trying to number with room for error on race conditions, and then going back to fix things up, but that probably isn't better.

Concurrent LinkedList vs ConcurrentLinkedQueue

I need a concurrent list that is thread safe and at the same time is best for iteration and should return exact size.
I want to to store auction bids for an item. So I want to be able to
retrieve the exact number of bids for an item
add a bid to a item
retrieve all the bids for a given item.
Remove a bid for a item
I am planning to have it in a
ConcurrentHashMap<Item, LinkedList<ItemBid>> -- LinkedList is not thread safe but returns exact size
ConcurrentHashMap<Item, ConcurrentLinkedQueue<ItemBid>> - concurrentlinked queue is thread safe but does not guarantee to return exact size
Is there any other better collection that will address the above 4 points and is thread safe.

Well arguably in a thread-safe collection or map you cannot guarantee the "consistency" of the size, meaning that the "happen-before" relationship between read and write operations will not benefit your desired use case, where a read operation on the size should return a value reflecting the exact state from the last write operation (N.B.: improved based on comments - see below).
What you can do if performance is not an issue is to use the following idiom - either:
Collections.synchronizedMap(new HashMap<YourKeyType, YourValueType>());
Collections.synchronizedList(new ArrayList<YourType>());
You'll then also need to explicitly synchronize over those objects.
This will ensure the order of operations is consistent at the cost of blocking, and you should get the last "right" size at all times.

You can use LinkedBlockingQueue. It is blocking (as apposed to the CLQ) but size is maintained and not scanned like the CLQ.

Multiple message listeners to single data store. Efficient design

I have a data store that is written to by multiple message listeners. Each of these message listeners can also be in the hundreds of individual threads.
The data store is a PriorityBlockingQueue as it needs to order the inserted objects by a timestamp. To make checking of the queue of items efficient rather than looping over the queue a concurrent hashmap is used as a form of index.
private Map<String, SLAData> SLADataIndex = new ConcurrentHashMap<String, SLAData>();;
private BlockingQueue<SLAData> SLADataQueue;
Question 1 is this a acceptable design or should I just use the single PriorityBlockingQueue.
Each message listener performs an operation, these listeners are scaled up to multiple threads.
Insert Method so it inserts into both.
this.SLADataIndex.put(dataToWrite.getMessageId(), dataToWrite);
this.SLADataQueue.add(dataToWrite);
Update Method
this.SLADataIndex.get(messageId).setNodeId(
updatedNodeId);
Delete Method
SLATupleData data = this.SLADataIndex.get(messageId);
//remove is O(log n)
this.SLADataQueue.remove(data);
// remove from index
this.SLADataIndex.remove(messageId);
Question Two Using these methods is this the most efficient way? They have wrappers around them via another object for error handling.
Question Three Using a concurrent HashMap and BlockingQueue does this mean these operations are thread safe? I dont need to use a lock object?
Question Four When these methods are called by multiple threads and listeners without any sort of synchronized block, can they be called at the same time by different threads or listeners?

Question 1 is this a acceptable design or should I just use the single PriorityBlockingQueue.
Certainly you should try to use a single Queue. Keeping the two collections in sync is going to require a lot more synchronization complexity and worry in your code.
Why do you need the Map? If it is just to call setNodeId(...) then I would have the processing thread do that itself when it pulls from the Queue.
// processing thread
while (!Thread.currentThread().isInterrupted()) {
dataToWrite = queue.take();
dataToWrite.setNodeId(myNodeId);
// process data
...
}
Question Two Using these methods is this the most efficient way? They have wrappers around them via another object for error handling.
Sure, that seems fine but, again, you will need to do some synchronization locking otherwise you will suffer from race conditions keeping the 2 collections in sync.
Question Three Using a concurrent HashMap and BlockingQueue does this mean these operations are thread safe? I dont need to use a lock object?
Both of those classes (ConcurrentHashMap and the BlockingQueue implementations) are thread-safe, yes. BUT since there are two of them, you can have race conditions where one collection has been updated but the other one has not. Most likely, you will have to use a lock object to ensure that both collections are properly kept in sync.
Question Four When these methods are called by multiple threads and listeners without any sort of synchronized block, can they be called at the same time by different threads or listeners?
That's a tough question to answer without seeing the code in question. For example. someone might be calling Insert(...) and has added it to the Map but not the queue yet, when another thread else calls Delete(...) and the item would get found in the Map and removed but the queue.remove() would not find it in the queue since the Insert(...) has not finished in the other thread.

Java concurrency - use which technique to achieve safety?

I have a list of personId. There are two API calls to update it (add and remove):
public void add(String newPersonName) {
if (personNameIdMap.get(newPersonName) != null) {
myPersonId.add(personNameIdMap.get(newPersonName)
} else {
// get the id from Twitter and add to the list
}
// make an API call to Twitter
}
public void delete(String personNAme) {
if (personNameIdMap.get(newPersonName) != null) {
myPersonId.remove(personNameIdMap.get(newPersonName)
} else {
// wrong person name
}
// make an API call to Twitter
}
I know there can be concurrency problem. I read about 3 solutions:
synchronized the method
use Collections.synchronizedlist()
CopyOnWriteArrayList
I am not sure which one to prefer to prevent the inconsistency.

1) synchronized the method
2) use Collections.synchronizedlist
3) CopyOnWriteArrayList ..
All will work, it's a matter of what kind of performance / features you need.
Method #1 and #2 are blocking methods. If you synchronize the methods, you handle concurrency yourself. If you wrap a list in Collections.synchronizedList, it handles it for you. (IMHO #2 is safer -- just be sure to use it as the docs say, and don't let anything access the raw list that is wrapped inside the synchronizedList.)
CopyOnWriteArrayList is one of those weird things that has use in certain applications. It's a non-blocking quasi-immutable list, namely, if Thread A iterates through the list while Thread B is changing it, Thread A will iterate through a snapshot of the old list. If you need non-blocking performance, and you are rarely writing to the list, but frequently reading from it, then perhaps this is the best one to use.
edit: There are at least two other options:
4) use Vector instead of ArrayList; Vector implements List and is already synchronized. However, it's generally frowned, upon as it's considered an old-school class (was there since Java 1.0!), and should be equivalent to #2.
5) access the List serially from only one thread. If you do this, you're guaranteed not to have any concurrency problems with the List itself. One way to do this is to use Executors.newSingleThreadExecutor and queue up tasks one-by-one to access the list. This moves the resource contention from your list to the ExecutorService; if the tasks are short, it may be fine, but if some are lengthy they may cause others to block longer than desired.
In the end you need to think about concurrency at the application level: thread-safety should be a requirement, and find out how to get the performance you need with the simplest design possible.
On a side note, you're calling personNameIdMap.get(newPersonName) twice in add() and delete(). This suffers from concurrency problems if another thread modifies personNameIdMap between the two calls in each method. You're better off doing
PersonId id = personNameIdMap.get(newPersonName);
if (id != null){
myPersonId.add(id);
}
else
{
// something else
}

Collections.synchronizedList is the easiest to use and probably the best option. It simply wraps the underlying list with synchronized. Note that multi-step operations (eg for loop) still need to be synchronized by you.
Some quick things
Don't synchronize the method unless you really need to - It just locks the entire object until the method completes; hardly a desirable effect
CopyOnWriteArrayList is a very specialized list that most likely you wouldn't want since you have an add method. Its essentially a normal ArrayList but each time something is added the whole array is rebuilt, a very expensive task. Its thread safe, but not really the desired result

Synchronized is the old way of working with threads. Avoid it in favor of new idioms mostly expressed in the java.util.concurrent package.
See 1.
A CopyOnWriteArrayList has fast read and slow writes. If you're making a lot of changes to it, it might start to drag on your performance.
Concurrency isn't about an isolated choice of what mechanism or type to use in a single method. You'll need to think about it from a higher level to understand all of its impacts.

Are you making changes to personNameIdMap within those methods, or any other data structures access to which should also be synchronized? If so, it may be easiest to mark the methods as synchronized; otherwise, you might consider using Collections.synchronizedList to get a synchronized view of myPersonId and then doing all list operations through that synchronized view. Note that you should not manipulate myPersonId directly in this case, but do all accesses solely through the list returned from the Collections.synchronizedList call.
Either way, you have to make sure that there can never be a situation where a read and a write or two writes could occur simultaneously to the same unsynchronized data structure. Data structures documented as thread-safe or returned from Collections.synchronizedList, Collections.synchronizedMap, etc. are exceptions to this rule, so calls to those can be put anywhere. Non-synchronized data structures can still be used safely inside methods declared to be synchronized, however, because such methods are guaranteed by the JVM to never run at the same time, and therefore there could be no concurrent reading / writing.

In your case from the code that you posted, all 3 ways are acceptable. However, there are some specific characteristics:
#3: This should have the same effect as #2 but may run faster or slower depending on the system and workload.
#1: This way is the most flexible. Only with #1 can you make the the add() and delete() methods more complex. For example, if you need to read or write multiple items in the list, then you cannot use #2 or #3, because some other thread can still see the list being half updated.

Java concurrency (multi-threading) :
Concurrency is the ability to run several programs or several parts of a program in parallel. If a time consuming task can be performed asynchronously or in parallel, this improve the throughput and the interactivity of the program.
We can do concurrent programming with Java. By java concurrency we can do parallel programming, immutability, threads, the executor framework (thread pools), futures, callables and the fork-join framework programmings.

Java Concurrency: lock effiency

My program has 100 threads.
Every single thread does this:
1) if arrayList is empty, add element with certain properties to it
2) if arrayList is not empty, iterate through elements found in arrayList, if found suitable element (matching certain properties), get it and remove the arrayList
The problem here is that while one thread is iterating through the arrayList, other 99 threads are waiting for the lock on arrayList.
What would you suggest to me if I want all 100 threads to work in lock-less condition? So they all have work to do?
Thanks

Have you looked at shared vs exclusive locking? You could use a shared lock on the list, and then have a 'deleted' property on the list elements. The predicate you use to check the list elements would need to make sure the element is not marked 'deleted' in addition to whatever other queries you have - also due to potential read-write conflicts, you would need to lock on each element as you traverse. Then periodically get an exclusive lock on the list to perform the deletes for real.
The read lock allows for a lot of concurrency on the list. The exclusive locks on each element of the list are not as nice, but you need to force the memory model to update your 'deleted' flag to each thread, so there's no way around that.

First if you're not running on a machine that has 64 cores or more your 100 threads are probably a performance hog in themselves.
Then an ArrayList for what you're describing is certainly not a good choice because removing an element does not run in amortized constant time but in linear time O(n). So that's a second performance hog. You probably want to use a LinkedList instead of your ArrayList (if you insist on using a List).
Now of course I doubt very much that you need to iterate over your complete list each time you need to find one element: wouldn't another data structure be more appropriate? Maybe that the elements that you put in your list have such a concept as "equality" and hence a Map with an O(1) lookup time could be used instead?
That's just for a start: as I showed you, there are at least two serious performances issues in what you described.... Maybe you should clarify your question if you want more help.

If your notion of "suitable element (matching certain properties)" can be encoded using a Comparator then a PriorityBlockingQueue would allow each thread to poll the queue, taking the next element without having to search the list or enqueuing a new element if the queue is empty.
Addendum: Thilo raise an essential point: As your approach evolves, you may want to determine empirically how many threads are optimal.

The key is to only use the object lock on arraylist when you actually need to.
A good idea would be to subclass arraylist and provide synchro on single read + write + delete processes.
This will ensure fine granularity with the locking while allowing the threads to run through the array list while protecting the semantics of the arraylist.

Have a single thread own the array and be responsible for adding to it and iterating over it to find work to do. Once a unit of work is found, put the work on a BlockingQueue. Have all your worker threads use take() to remove work from the queue.
This allows multiple units of work to be discovered per pass through the array and they can be handed off to waiting worker threads fairly efficiently.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.