Iterating over single List in parallel without duplicates in Java

Iterating over single List in parallel without duplicates in Java - java

I have an ArrayList filled with 'someObject'. I need to iterate over this list, with 4 different threads (using Futures & Callables). The threads will keep the top 5 valued objects it comes across. I first tried creating a parallel stream, but that didn't work out so well. Is there some obvious thing I'm not thinking of, so each thread can iterate over the objects, without possibly grabbing the same object twice?

You can use an AtomicInteger to iterate over the list:
class MyRunnable implements Runnable {
final List<SomeObject> list;
final AtomicInteger counter; // initialize to 0
public void run() {
while(true) {
int index = counter.getAndIncrement();
if(index < list.size()) {
do something with list.get(index);
} else {
return;
}
}
}
}
So long as each MyRunnable has the same AtomicInteger reference they won't duplicate indices

You don't need AtomicInteger or any other synchronization for that matter.
You should simply logically partition your list (whose size is known upfront) based on the number of processing threads (whose number is also known upfront) and let each of them operate on its own section of [from, to) of the list.
This avoid the need for any synchronization at all (even if it's just an optimized one such as AtomicInteger) which is what you should always strive for (as long as it's safe).
Pseudo code
class Worker<T> implements Runnable {
final List<T> toProcess;
protected Worker(List<T> list, int fromInc, int toExcl){
// note this does not allow passing an empty list or specifying an empty work section but you can relax that if you wish
// this also implicitly checks the list for null
Preconditions.checkArgument(fromInc >= 0 && fromInc < list.size());
Preconditions.checkArgument(toExcl > 0 && fromInc <= list.size());
// note: this does not create a copy, but only a view so it's very cheap
toProcess = list.subList(fromInc, toExcl);
}
#Override
public final void run() {
for(final T t : toProcess) {
process(t);
}
}
protected abstract process(T t);
}
As with the AtomicInteger solution (really any solution which does not involve copying the list), this solution also assumes that you will not be modifying the list once you have handed it off to each thread and processing has commenced. Modifying the list while processing is in progress will result in undefined behavior.

Related

Java ArrayList thread unsafe example explanation

class ThreadUnsafe {
static final int THREAD_NUMBER = 2;
static final int LOOP_NUMBER = 200;
public static void main(String[] args) {
ThreadUnsafe test = new ThreadUnsafe();
for (int i = 0; i < THREAD_NUMBER; i++) {
new Thread(() -> {
test.method1(LOOP_NUMBER);
}, "Thread" + i).start();
}
}
ArrayList<String> list = new ArrayList<>();
public void method1(int loopNumber) {
for (int i = 0; i < loopNumber; i++) {
method2();
method3();
}
}
private void method2() {
list.add("1");
}
private void method3() {
list.remove(0);
}
}
The code above throws
java.lang.IndexOutOfBoundsException: Index: 0, Size: 1
I know ArrayList is not thread-safe, but in the example, I think every remove() call is guaranteed to be preceded by at least one add() call, so the code should be OK even the order is messed up like the following:
thread0: method2()
thread1: method2()
thread1: method3()
thread0: method3()
Some explanations needed here, please.

If always one add() or remove() call is completely finished before another one is started, your reasoning is correct. But ArrayList doesn't guarantee that as its methods aren't synchronized. So, it can happen that two threads are in the middle of some modifying calls at the same time.
Let's look at the internals of e.g. the add() method to understand one possible failure mode.
When adding an element, ArrayList increases the size using size++. And this is not atomic.
Now imagine the list being empty, and two threads A and B adding an element at exactly the same moment, doing the size++ in parallel (maybe in different CPU cores). Let's imagine things happen in the following order:
A reads size as 0.
B reads size as 0.
A adds one to its value, giving 1.
B adds one to its value, giving 1.
A writes its new value back into the size field, resulting in size=1.
B writes its new value back into the size field, resulting in size=1.
Although we had 2 add() calls, the size is only 1. If now you try to remove 2 elements (and this time it happens sequentially), the second remove() will fail.
To achieve thread safety, no other thread should be able to mess around with the internals like size (or the elements array) while one access is currently in progress.
Multi-threading is inherently complex in that the calls from multiple threads can not only happen in any (expected or unexpected) order, but that they can also overlap, unless protected by some mechanism like synchronized. On the other hand, excessive use of the synchronization can easily lead to poor multi-thread performance, and also to dead-locks.

As a supplement to #RalfKleberhoff's answer,
I think every remove() call is guaranteed to be preceded by at least one add() call,
Yes.
so the code should be OK even the order is messed up
No, that is not a valid inference with respect to a multithreaded program.
Your program contains data races as a result of two threads both accessing the same shared, non-atomic object, with some of those accesses being writes, without appropriate synchronization. The whole behavior of a program that contains data races is undefined, so in fact you cannot draw any conclusions at all about its behavior.
Do not try to cheat or scrimp on synchronization. Do minimize the amount of it that you need by limiting your use of shared objects, but where you need it, you need it, and the rules for determining when and where you need it are not that hard to learn.

ArrayList in java docs says,
Note that this implementation is not synchronized. If multiple threads
access an ArrayList instance concurrently, and at least one of the
threads modifies the list structurally, it must be synchronized
externally.
Why this code is not thread safe ?
Multiple thread running on Machine runs independent of each other.
public void method1(int loopNumber) {
for (int i = 0; i < loopNumber; i++) {
method2();
method3();
}
}
Here method2() and method3() are being process sequential within
the thread but not across the thread. ArrayList list is common between both thread. which will be in inconstant state between both thread on multi core system.
Interesting test would be add empty check in method3() and set LOOP_NUMBER = 10000;
private void method3()
{
if (!list.isEmpty())
list.remove(0);
}
In result you should get same Runtime Exception some thing like java.lang.IndexOutOfBoundsException: Index: 0, Size: 1 or java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 because of same reason inconstant state of variable in list i.e. size.
To fix this issue you could have added synchronized like below or use Syncronized list
public void method1(int loopNumber)
{
for (int i = 0; i < loopNumber; i++)
{
synchronized (list)
{
method2();
method3();
}
}
}

Java Concurrency In Practice. Listing 5.6

In Java Concurrency in Practice author gives the following example of a not-thread safe class, that behind the scenes invokes iterator on a set object and if multiple threads are involved, this may cause a ConcurrentModificationException. This is understood, one thread is modifying the collection, the other is iterating over it and, - boom!
What I do not understand, - the author is saying that this code can be fixed by wrapping a HashSet with Collections.synchronizedSet(). How would this fix a problem? Even though access to all the methods will be synchronized and guarded by the same intrinsic lock, once the iterator object is obtained, there is no guarantee that the other thread won't modify the collection once an iteration is being made.
Quote from the book:
If HiddenIterator wrapped the HashSet with a synchronizedSet, encapsulating the synchronization, this sort of error would not occur.
public class HiddenIterator {
//Solution :
//If HiddenIterator wrapped the HashSet with a synchronizedSet, encapsulating the synchronization,
//this sort of error would not occur.
//#GuardedBy("this")
private final Set<Integer> set = new HashSet<Integer>();
public synchronized void add(Integer i) {
set.add(i);
}
public synchronized void remove(Integer i) {
set.remove(i);
}
public void addTenThings() {
Random r = new Random();
for (int i = 0; i < 10; i++)
add(r.nextInt());
/*The string concatenation gets turned by the compiler into a call to StringBuilder.append(Object),
* which in turn invokes the collection's toString method - and the implementation of toString in
* the standard collections iterates the collection and calls toString on each element to
* produce a nicely formatted representation of the collection's contents. */
System.out.println("DEBUG: added ten elements to " + set);
}
}
If someone could help me understand that, I'd be grateful.
Here is how I think it could've been fixed:
public class HiddenIterator {
private final Set<Integer> set = Collections.synchronizedSet(new HashSet<Integer>());
public void add(Integer i) {
set.add(i);
}
public void remove(Integer i) {
set.remove(i);
}
public void addTenThings() {
Random r = new Random();
for (int i = 0; i < 10; i++)
add(r.nextInt());
// synchronizing in set's intrinsic lock
synchronized(set) {
System.out.println("DEBUG: added ten elements to " + set);
}
}
}
Or, as an alternative, one could keep synchronized keyword for add() and remove() methods. We'd be synchronizing on this in this case. Also, we'd have to add a synchronized block (again sync'ed on this) into addTenThings(), which would contain a single operation - logging with implicit iteration:
public class HiddenIterator {
private final Set<Integer> set = new HashSet<Integer>();
public synchronized void add(Integer i) {
set.add(i);
}
public synchronized void remove(Integer i) {
set.remove(i);
}
public void addTenThings() {
Random r = new Random();
for (int i = 0; i < 10; i++)
add(r.nextInt());
synchronized(this) {
System.out.println("DEBUG: added ten elements to " + set);
}
}
}

Collections.synchronizedSet() wraps the collection in an instance of an internal class called SynchronizedSet, extending SynchronizedCollection. Now let's look how's the SynchronizedCollection.toString() is implemented:
public String toString() {
synchronized (mutex) {return c.toString();}
}
Basically the iteration is still there, hidden in the c.toString() call, but it's already synchronized with all other methods of this wrapper collection. So you don't need to repeat the synchronization in your code.

Edited
synchronizedSet()::toString()
As Sergei Petunin pointed out rightly, the toString() method of Collections.synchronizedSet() internally takes care about synchronisation, so no manual synchronistion is necessary in this case.
external iteration on synchronizedSet()
once the iterator object is obtained, there is no guarantee that the other thread won't modify the collection once an iteration is being made.
In cases of external iteration, like using for-each or an Iterator, the approach with encapsulating that iteration in an synchronize(set) block is required/sufficient.
That's why the JavaDoc of Collections.synchronizedSet() states, that
It is imperative that the user manually synchronize on the returned
sorted set when iterating over it or any of its subSet, headSet, or
tailSet views.
SortedSet s = Collections.synchronizedSortedSet(new TreeSet());
...
synchronized (s) {
Iterator i = s.iterator(); // Must be in the synchronized block
while (i.hasNext())
foo(i.next());
}
manual synchronization
Your second version with the synchronized add/remove methods of the class HiddenIterator and synchronize(this) would work too, but it introduces unneccesarry overhead as adding/removing would be synchronized twice (by HiddenIterator and Collections.synchronizedSet(..).
However, in this case you could omit the Collections.synchronizedSet(..) as HiddenIterator takes care of all the synchronization required when accessing the private Set field.

Multiple threads checking map size and conccurency

I have a method that's supposed to feed a map from a queue and it only does that if the map size is not exceeding a certain number. This prompted concurrency problem as the size I get from every thread is non coherent globaly. I replicated the problem by this code
import java.sql.Timestamp;
import java.util.Date;
import java.util.concurrent.ConcurrentHashMap;
public class ConcurrenthashMapTest {
private ConcurrentHashMap<Integer, Integer> map = new ConcurrentHashMap<Integer, Integer>();
private ThreadUx[] tArray = new ThreadUx[999];
public void parallelMapFilling() {
for ( int i = 0; i < 999; i++ ) {
tArray[i] = new ThreadUx( i );
}
for ( int i = 0; i < 999; i++ ) {
tArray[i].start();
}
}
public class ThreadUx extends Thread {
private int seq = 0;
public ThreadUx( int i ) {
seq = i;
}
#Override
public void run() {
while ( map.size() < 2 ) {
map.put( seq, seq );
System.out.println( Thread.currentThread().getName() + " || The size is: " + map.size() + " || " + new Timestamp( new Date().getTime() ) );
}
}
}
public static void main( String[] args ) {
new ConcurrenthashMapTest().parallelMapFilling();
}
}
Normally I should have only one line of output and the size not exceeding 1, but I do have some stuff like this
Thread-1 || The size is: 2 || 2016-06-07 18:32:55.157
Thread-0 || The size is: 2 || 2016-06-07 18:32:55.157
I tried marking the whole run method as synchronized but that didn't work, only when I did this
#Override
public void run() {
synchronized ( map ) {
if ( map.size() < 1 ) {
map.put( seq, seq );
System.out.println( Thread.currentThread().getName() + " || The size is: " + map.size() + " || " + new Timestamp( new Date().getTime() ) );
}
}
}
It worked, why is only the synch block working and the synch method? Also I don't want to use something as old as a synch block as I am working on a Java EE app, is there a Spring or Java EE task executor or annotation that can help?

From Java Concurrency in Practice:
The semantics of methods of ConcurrentHashMap that operate on the entire Map, such as size and isEmpty, have been slightly weakened to reflect the concurrent nature of the collection. Since the result of size could be out of date by the time it is computed, it is really only an estimate, so size is allowed to return an approximation instead of an exact count. While at first this may seem disturbing, in reality methods like size and isEmpty are far less useful in concurrent environments because these quantities are moving targets. So the requirements for these operations were weakened to enable performance optimizations for the most important operations, primarily get, put, containsKey, and remove.
The one feature offered by the synchronized Map implementations but not by ConcurrentHashMap is the ability to lock the map for exclusive access. With Hashtable and synchronizedMap, acquiring the Map lock prevents any other thread from accessing it. This might be necessary in unusual cases such as adding several mappings atomically, or iterating the Map several times and needing to see the same elements in the same order. On the whole, though, this is a reasonable tradeoff: concurrent collections should be expected to change their contents continuously.
Solutions:
Refactor design and do not use size method with concurrent access.
To use methods as size and isEmpty you can use synchronized collection Collections.synchronizedMap. Synchronized collections achieve their thread safety by serializing all access to the collection's state. The cost of this approach is poor concurrency; when multiple threads contend for the collection-wide lock, throughput suffers. Also you will need to synchronize the block where it checks-and-puts with map instance, because it's a compound action.
Third. Use third-party implementation or write your own.
public class BoundConcurrentHashMap <K,V> {
private final Map<K, V> m;
private final Semaphore semaphore;
public BoundConcurrentHashMap(int size) {
m = new ConcurrentHashMap<K, V>();
semaphore = new Semaphore(size);
}
public V get(V key) {
return m.get(key);
}
public boolean put(K key, V value) {
boolean hasSpace = semaphore.tryAcquire();
if(hasSpace) {
m.put(key, value);
}
return hasSpace;
}
public void remove(Object key) {
m.remove(key);
semaphore.release();
}
// approximation, do not trust this method
public int size(){
return m.size();
}
}
Class BoundConcurrentHashMap is as effective as ConcurrentHashMap and almost thread-safe. Because removing an element and releasing semaphore in remove method are not simultaneous as it should be. But in this case it is tolerable. size method still returns approximated value, but put method will not allow to exceed map size.

You are using ConcurrentHashMap, and according to the API doc:
Bear in mind that the results of aggregate status methods including
size, isEmpty, and containsValue are typically useful only when a map
is not undergoing concurrent updates in other threads. Otherwise the
results of these methods reflect transient states that may be adequate
for monitoring or estimation purposes, but not for program control.
Which means you cannot get accurate result unless you explicit synchronize the access to size().
Adding synchronized to the run method does not work because threads are not synchronizing on the same lock object -- each getting a lock on itself.
Synchronizing on the map itself definitely work, but IMHO it's not a good choice because then you lose the performance advantage ConcurrentHashMap can provide.
In conclusion you need to reconsider the design.

Accessing list using multiple threads

Is the compute() function thread safe? Will multiple threads loop correctly over the list?
class Foo {
private List<Integer> list;
public Foo(List<Integer> list) {
this.list = list;
}
public void compute() {
for (Integer i: list) {
// do some thing with it
// NO LIST modifications
}
}
}

Considering that data does not mutate (as you mentioned in the comment) there will not be any dirty / phantom reads.

If the list is created specifically for the purposes of that method, then you're good to go. That is, if the list isn't modified in any other method or class, then that code is thread safe, since you're only reading.
A general recommendation is to make a read-only copy of the collection, if you're not sure the argument comes from a trustworthy origin (and even if you are sure).
this.list = Collections.unmodifiableList(new ArrayList<Integer>(list));
Note, however, that the elements of the list must also be thread-safe. If, in your real scenario, the list contains some mutable structure, instead of Integer (which are immutable), you should make sure that any modifications to the elements are also thread-safe.

If you can guarantee that the list is not modified elsewhere while you're iterating over it that code is thread safe.
I would create a read-only copy of the list though to be absolutely sure that it won't be modified elsewhere:
class Foo {
private List<Integer> list;
public Foo(List<Integer> list) {
this.list = Collections.unmodifiableList(new ArrayList<>(list));
}
public void compute() {
for (Integer i: list) {
// do some thing with it
// NO LIST modifications
}
}
}
If you don't mind adding a dependency to your project I suggest using Guava's ImmutableList:
this.list = ImmutableList.copyOf(list);
It is also a good idea to use Guavas immutable collections wherever you're using collections that aren't changing since they are inherently thread safe due to being immutable.

You can easily inspect the behavior when having for example 2 threads:
public class Test {
public static void main(String[] args) {
Runnable task1 = () -> { new Foo().compute(); };
Runnable task2 = () -> { new Foo().compute(); };
new Thread(task1).start();
new Thread(task2).start();
}
}
If the list is guaranteed not to be changed anywhere else, iterating on it is thread safe, if you implement compute to simply print the list content, debugging your code should help you understanding it is thread safe.

There is thread safe list in cocncurent library. If you want thread-safe collections always use it. Thread-safe list is CopyOnWriteArrayList

This version
class Foo {
private final List<Integer> list;
public Foo(List<Integer> list) {
this.list = new ArrayList<>(list);
}
public void compute() {
for(Integer i: list) {
// ...
}
}
}
is thread-safe, if following holds:
list arg to ctor can't be modified during ctor run time (e.g., it is local variable in caller) or thread-safe itself (e.g., CopyOnWriteArrayList);
compute won't modify list contents (just as OP stated). I guess compute should be not void but return some numeric value, to be of any utility...

Java Synchronization - Mutex.wait vs List.wait

While using Java Threading Primitives to construct a thread safe bounded queue - whats the difference between these 2 constructs
Creating an explicit lock object.
Using the list as the lock and waiting on it.
Example of 1
private final Object lock = new Object();
private ArrayList<String> list = new ArrayList<String>();
public String dequeue() {
synchronized (lock) {
while (list.size() == 0) {
lock.wait();
}
String value = list.remove(0);
lock.notifyAll();
return value;
}
}
public void enqueue(String value) {
synchronized (lock) {
while (list.size() == maxSize) {
lock.wait();
}
list.add(value);
lock.notifyAll();
}
}
Example of 2
private ArrayList<String> list = new ArrayList<String>();
public String dequeue() {
synchronized (list) { // lock on list
while (list.size() == 0) {
list.wait(); // wait on list
}
String value = list.remove(0);
list.notifyAll();
return value;
}
}
public void enqueue(String value) {
synchronized (list) { // lock on list
while (list.size() == maxSize) {
list.wait(); // wait on list
}
list.add(value);
list.notifyAll();
}
}
Note
This is a bounded list
No other operation is being performed apart from enqueue and dequeue.
I could use a blocking queue, but this question is more for improving my limited knowledge of threading.
If this question is repeated please let me know.

The short answer is, no, there is no functional difference, other than the extra memory overhead of maintaining that extra lock object. However, there are a couple of semantics-related items I would consider before making a final decision.
Will I ever need to perform synchronized operations on more than just my internal list?
Let's say you wanted to maintain a parallel data structure to your ArrayList, such that all operations on the list and that parallel data structure needed to be synchronized. In this case, it might be best to use the external lock, as locking on either the list or the structure might be confusing to future development efforts on this class.
Will I ever give access to my list outside of my queue class?
Let's say you wanted to provide an accessor method for your list, or make it visible to extensions of your Queue class. If you were using an external lock object, classes that retrieved references to the list would never be able to perform thread-safe operations on that list. In that case, it'd be better to synchronize on the list and make it clear in the API that external accesses/modifications to the list must also synchronize on that list.
I'm sure there are more reasons why you might choose one over the other, but these are the two big ones I can think of.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.