CopyOnWriteArrayList or Vector

CopyOnWriteArrayList or Vector - java

All,
The edge Vector class has over ArrayList is that it is synchronized and hence ensures thread-safety. However, between CopyOnWriteArrayList and Vector, what should be the preferred considering thread safety and performance in consideration.

It depends on the usage pattern - if you have much more reads than writes, use CopyOnWriteArrayList, otherwise use Vector.
Vector introduces a small synchronization delay for each operation, when CopyOnWriteArrayList has a longer delay for write (due to copying) but no delay for reads.
Another consideration is a behaviour of iterators - Vector requires explicit synchronization when you are iterating it (so write operations can't be executed at the same time), CopyOnWriteArrayList doesn't.

Overall, it depends on the frequency and nature of read and write operations, and the size of the array.
You'll need to benchmark in your context to be sure, but here are some general principles:
If you are only going to read the
array, then even ArrayList is
thread safe (since the only
non-thread-safe modifications are
those that modify the list). Hence
you would want to use a
non-synchronised data structure,
either ArrayList or
CopyOnWriteArrayList would probably
work equally well.
If reads are much more common
compared to writes then you would
tend to prefer CopyOnWriteArrayList,
since the array copying overhead is
only incurred on writes.
If the size of the Array is
small, then the cost of making
array copies will also be small,
hence this will favour
CopyOnWriteArrayList over Vector.
You may also want to consider two other options:
Use an ArrayList but put the synchronisation elsewhere to ensure thread safety. This is actually the method I personally use most often - basically the idea is to use a separate, higher-level lock to protect all the relevant data structures at the same time. This is much more efficient than having synchronisation on every single operation as Vector does.
Consider an immutable persistent data structure - these are guaranteed to be thread safe due to immutability, do not require synchronisation and also benefit from low overhead (i.e. they share most data between different instances rather than making complete new copies). Languages like Clojure use these to get ArrayList-like performance while also guaranteeing full thread safety.

Related

Non-locking interaction with an array from several threads (in Java)

I would need an array of Strings to be accessed from two threads. It has to be very fast and thread-safe. I prefer not to use locks, what approach I can take to make lock-free thread-safe array of Strings? I need a recipe in Java.

By definition, the only thread-safe writes available to memory shared by contended threads are actions that are provided by atomic instructions in the CPU. This isn't really relevant for Java (at least, almost all of the time), but it is worth noting that writes without locks in a concurrent environment are possible.
So, this is to say, that if you want to write to the array, you are likely going to need to have locks. Locks are the solution to the general problem.
You can, however, happily share an array between many threads without issue as long as they are only reading from the array. So, if your array is immutable (or any other object for that matter), it will be thread-safe by virtue of there never being an opportunity for contention.
So, let's suppose you want to write to the array from two different threads, but you are worried about contention. Maybe each thread wants to be recording a lot of data. There's several different solutions to this problem: I'll try to explain a few. This isn't exhaustive because concurrency is a hard problem to solve and although there are some common approaches, often enough the answer really depends on the specific situation.
The simplest approach
Just use a lock on the array when you write to it and see how it performs. Maybe you don't actually need to worry about performance problems right now.
Use a producer/consumer approach
Rather than having two threads write to the same array, have each of them "produce" values (maybe put them on different thread-safe queues) and have another thread responsible for "consuming" those values (remove them from the queue and put them in the array).
If order matters, this approach can be tricky to implement. But you are using concurrency, so ordering will be fairly non-deterministic anyways.
Do writes in batches
The idea here is that you would store the values you want to put in the array from each thread in it's own temporary batch of values. When the batch reaches a large enough size, the thread would lock the array and write the entire batch.
Write to separate parts of the array
If you know the size of your data, you can avoid contention by simply not allowing threads to write to the same index ranges. You'd divide the array up by the number of threads. Each thread, when created, would be given a start index into the array.
This option might fit what you are looking for (lock-free, thread-safe).

How about using the built in Collections.synchronizedList?

How does ConcurrentHashMap work internally?

I was reading the official Oracle documentation about Concurrency in Java and I was wondering what could be the difference between a Collection returned by
public static <T> Collection<T> synchronizedCollection(Collection<T> c);
and using for example a
ConcurrentHashMap. I'm assuming that I use synchronizedCollection(Collection<T> c) on a HashMap. I know that in general a synchronized collection is essentially just a decorator for my HashMap so it is obvious that a ConcurrentHashMap has something different in its internals. Do you have some information about those implementation details?
Edit: I realized that the source code is publicly available:
ConcurrentHashMap.java

I would read the source of ConcurrentHashMap as it is rather complicated in the detail. In short it has
Multiple partitions which can be locked independently. (16 by default)
Using concurrent Locks operations for thread safety instead of synchronized.
Has thread safe Iterators. synchronizedCollection's iterators are not thread safe.
Does not expose the internal locks. synchronizedCollection does.

The ConcurrentHashMap is very similar to the java.util.HashTable class, except that ConcurrentHashMap offers better concurrency than HashTable or synchronizedMap does. ConcurrentHashMap does not lock the Map while you are reading from it. Additionally,ConcurrentHashMap does not lock the entire Mapwhen writing to it. It only locks the part of the Map that is being written to, internally.
Another difference is that ConcurrentHashMap does not throw ConcurrentModificationException if the ConcurrentHashMap is changed while being iterated. The Iterator is not designed to be used by more than one thread though whereas synchronizedMap may throw ConcurrentModificationException

This is the article that helped me understand it Why ConcurrentHashMap is better than Hashtable and just as good as a HashMap
Hashtable’s offer concurrent access to their entries, with a small caveat, the entire map is locked to perform any sort of operation.
While this overhead is ignorable in a web application under normal
load, under heavy load it can lead to delayed response times and
overtaxing of your server for no good reason.
This is where ConcurrentHashMap’s step in. They offer all the features
of Hashtable with a performance almost as good as a HashMap.
ConcurrentHashMap’s accomplish this by a very simple mechanism.
Instead of a map wide lock, the collection maintains a list of 16
locks by default, each of which is used to guard (or lock on) a single
bucket of the map. This effectively means that 16 threads can modify
the collection at a single time (as long as they’re all working on
different buckets). Infact there is no operation performed by this
collection that locks the entire map. The concurrency level of the
collection, the number of threads that can modify it at the same time
without blocking, can be increased. However a higher number means more
overhead of maintaining this list of locks.

The "scalability issues" for Hashtable are present in exactly the same way in Collections.synchronizedMap(Map) - they use very simple synchronization, which means that only one thread can access the map at the same time.
This is not much of an issue when you have simple inserts and lookups (unless you do it extremely intensively), but becomes a big problem when you need to iterate over the entire Map, which can take a long time for a large Map - while one thread does that, all others have to wait if they want to insert or lookup anything.
The ConcurrentHashMap uses very sophisticated techniques to reduce the need for synchronization and allow parallel read access by multiple threads without synchronization and, more importantly, provides an Iterator that requires no synchronization and even allows the Map to be modified during interation (though it makes no guarantees whether or not elements that were inserted during iteration will be returned).

Returned by synchronizedCollection() is an object all methods of which are synchronized on this, so all concurrent operations on such wrapper are serialized. ConcurrentHashMap is a truly concurrent container with fine grained locking optimized to keep contention as low as possible. Have a look at the source code and you will see what it is inside.

ConcurrentHashMap implements ConcurrentMap which provides the concurrency.
Deep internally its iterators are designed to be used by only one thread at a time which maintains the synchronization.
This map is used widely in concurrency.

Is it a good practice to wrap ConcurrentHashMap read and write operations with ReentrantLock?

I think in the implementation of ConcurrentHashMap, ReentrantLock has already been used. So there is no need to use ReentrantLock for the access of a ConcurrentHashMap object. And that will only add more synchronization overhead. Any comments?

What would you (or anyone) like to achieve with that? ConcurrentHashMap is already thread-safe as it is. Wrapping it with extra locking code would just slow it down significantly, since
it does not lock on reads per se,
even for writes, you can hardly mimic its internal lock partitioning behaviour externally.
In other words, adding extra locking would increase the chance of thread contention significantly (as well as making the thread safety guarantees of read operations stricter, for the record).
ConcurrentHashMap provides an implementation of ConcurrentMap and offers a highly effective solution to the problem of reconciling throughput with thread safety. It is optimized for reading, so retrievals do not block even while the table is being updated (to allow for this, the contract states that the results of retrievals will reflect the latest update operations completed before the start of the retrieval). Updates also can often proceed without blocking, because a ConcurrentHashMap consists of not one but a set of tables, called segments, each of which can be independently locked. If the number of segments is large enough relative to the number of threads accessing the table, there will often be no more than one update in progress per segment at any time.
From Java Generics and Collections, chapter 16.4.

The whole point of ConcurrentHashMap, is not to lock around access/modifications to it. Extra locking just adds overhead.

Performance ConcurrentHashmap vs HashMap

How is the performance of ConcurrentHashMap compared to HashMap, especially .get() operation (I'm especially interested for the case of only few items, in the range between maybe 0-5000)?
Is there any reason not to use ConcurrentHashMap instead of HashMap?
(I know that null values aren't allowed)
Update
just to clarify, obviously the performance in case of actual concurrent access will suffer, but how compares the performance in case of no concurrent access?

I was really surprised to find this topic to be so old and yet no one has yet provided any tests regarding the case. Using ScalaMeter I have created tests of add, get and remove for both HashMap and ConcurrentHashMap in two scenarios:
using single thread
using as many threads as I have cores available. Note that because HashMap is not thread-safe, I simply created separate HashMap for each thread, but used one, shared ConcurrentHashMap.
Code is available on my repo.
The results are as follows:
X axis (size) presents number of elements written to the map(s)
Y axis (value) presents time in milliseconds
The summary
If you want to operate on your data as fast as possible, use all the threads available. That seems obvious, each thread has 1/nth of the full work to do.
If you choose a single thread access use HashMap, it is simply faster. For add method it is even as much as 3x more efficient. Only get is faster on ConcurrentHashMap, but not much.
When operating on ConcurrentHashMap with many threads it is similarly effective to operating on separate HashMaps for each thread. So there is no need to partition your data in different structures.
To sum up, the performance for ConcurrentHashMap is worse when you use with single thread, but adding more threads to do the work will definitely speed-up the process.
Testing platform
AMD FX6100, 16GB Ram
Xubuntu 16.04, Oracle JDK 8 update 91, Scala 2.11.8

Thread safety is a complex question. If you want to make an object thread safe, do it consciously, and document that choice. People who use your class will thank you if it is thread safe when it simplifies their usage, but they will curse you if an object that once was thread safe becomes not so in a future version. Thread safety, while really nice, is not just for Christmas!
So now to your question:
ConcurrentHashMap (at least in Sun's current implementation) works by dividing the underlying map into a number of separate buckets. Getting an element does not require any locking per se, but it does use atomic/volatile operations, which implies a memory barrier (potentially very costly, and interfering with other possible optimisations).
Even if all the overhead of atomic operations can be eliminated by the JIT compiler in a single-threaded case, there is still the overhead of deciding which of the buckets to look in - admittedly this is a relatively quick calculation, but nevertheless, it is impossible to eliminate.
As for deciding which implementation to use, the choice is probably simple.
If this is a static field, you almost certainly want to use ConcurrentHashMap, unless testing shows this is a real performance killer. Your class has different thread safety expectations from the instances of that class.
If this is a local variable, then chances are a HashMap is sufficient - unless you know that references to the object can leak out to another thread. By coding to the Map interface, you allow yourself to change it easily later if you discover a problem.
If this is an instance field, and the class hasn't been designed to be thread safe, then document it as not thread safe, and use a HashMap.
If you know that this instance field is the only reason the class isn't thread safe, and are willing to live with the restrictions that promising thread safety implies, then use ConcurrentHashMap, unless testing shows significant performance implications. In that case, you might consider allowing a user of the class to choose a thread safe version of the object somehow, perhaps by using a different factory method.
In either case, document the class as being thread safe (or conditionally thread safe) so people who use your class know they can use objects across multiple threads, and people who edit your class know that they must maintain thread safety in future.

I would recommend you measure it, since (for one reason) there may be some dependence on the hashing distribution of the particular objects you're storing.

The standard hashmap provides no concurrency protection whereas the concurrent hashmap does. Before it was available, you could wrap the hashmap to get thread safe access but this was coarse grain locking and meant all concurrent access got serialised which could really impact performance.
The concurrent hashmap uses lock stripping and only locks items that affected by a particular lock. If you're running on a modern vm such as hotspot, the vm will try and use lock biasing, coarsaning and ellision if possible so you'll only pay the penalty for the locks when you actually need it.
In summary, if your map is going to be accesaed by concurrent threads and you need to guarantee a consistent view of it's state, use the concurrent hashmap.

In the case of a 1000 element hash table using 10 locks for whole table saves close to half the time when 10000 threads are inserting and 10000 threads are deleting from it.
The interesting run time difference is here
Always use Concurrent data structure. except when the downside of striping (mentioned below) becomes a frequent operation. In that case you will have to acquire all the locks? I read that the best ways to do this is by recursion.
Lock striping is useful when there is a way of breaking a high contention lock into multiple locks without compromising data integrity. If this is possible or not should take some thought and is not always the case. The data structure is also the contributing factor to the decision. So if we use a large array for implementing a hash table, using a single lock for the entire hash table for synchronizing it will lead to threads sequentially accessing the data structure. If this is the same location on the hash table then it is necessary but, what if they are accessing the two extremes of the table.
The down side of lock striping is it is difficult to get the state of the data structure that is affected by striping. In the example the size of the table, or trying to list/enumerate the whole table may be cumbersome since we need to acquire all of the striped locks.

What answer are you expecting here?
It is obviously going to depend on the number of reads happening at the same time as writes and how long a normal map must be "locked" on a write operation in your app (and whether you would make use of the putIfAbsent method on ConcurrentMap). Any benchmark is going to be largely meaningless.

It's not clear what your mean. If you need thread safeness, you have almost no choice - only ConcurrentHashMap. And it's definitely have performance/memory penalties in get() call - access to volatile variables and lock if you're unlucky.

Of course a Map without any lock system wins against one with thread-safe behavior which needs more work.
The point of the Concurrent one is to be thread safe without using synchronized so to be faster than HashTable.
Same graphics would would be very interesting for ConcurrentHashMap vs Hashtable (which is synchronized).

Best way to control concurrent access to Java collections

Should I use old synchronized Vector collection, ArrayList with synchronized access or Collections.synchronizedList or some other solution for concurrent access?
I don't see my question in Related Questions nor in my search (Make your collections thread-safe? isn't the same).
Recently, I had to make kind of unit tests on GUI parts of our application (basically using API to create frames, add objects, etc.).
Because these operations are called much faster than by a user, it shown a number of issues with methods trying to access resources not yet created or already deleted.
A particular issue, happening in the EDT, came from walking a linked list of views while altering it in another thread (getting a ConcurrentModificationException among other problems).
Don't ask me why it was a linked list instead of a simple array list (even less as we have in general 0 or 1 view inside...), so I took the more common ArrayList in my question (as it has an older cousin).
Anyway, not super familiar with concurrency issues, I looked up a bit of info, and wondered what to choose between the old (and probably obsolete) Vector (which has synchronized operations by design), ArrayList with a synchronized (myList) { } around critical sections (add/remove/walk operations) or using a list returned by Collections.synchronizedList (not even sure how to use the latter).
I finally chose the second option, because another design mistake was to expose the object (getViewList() method...) instead of providing mechanisms to use it.
But what are the pros and cons of the other approaches?
[EDIT] Lot of good advices here, hard to select one. I will choose the more detailed and providing links/food for thoughts... :-) I like Darron's one too.
To summarize:
As I suspected, Vector (and its evil twin, Hashtable as well, probably) is largely obsolete, I have seen people telling its old design isn't as good as newer collections', beyond the slowness of synchronization forced even in single thread environment. If we keep it around, it is mostly because older libraries (and parts of Java API) still use it.
Unlike what I thought, Collections.synchronizedXxxx aren't more modern than Vector (they appear to be contemporary to Collections, ie. Java 1.2!) and not better, actually. Good to know. In short, I should avoid them as well.
Manual synchronization seems to be a good solution after all. There might be performance issues, but in my case it isn't critical: operations done on user actions, small collection, no frequent use.
java.util.concurrent package is worth keeping in mind, particularly the CopyOnWrite methods.
I hope I got it right... :-)

Vector and the List returned by Collections.synchronizedList() are morally the same thing. I would consider Vector to be effectively (but not actually) deprecated and always prefer a synchronized List instead. The one exception would be old APIs (particularly ones in the JDK) that require a Vector.
Using a naked ArrayList and synchronizing independently gives you the opportunity to more precisely tune your synchronization (either by including additional actions in the mutually exclusive block or by putting together multiple calls to the List in one atomic action). The down side is that it is possible to write code that accesses the naked ArrayList outside synchronization, which is broken.
Another option you might want to consider is a CopyOnWriteArrayList, which will give you thread safety as in Vector and synchronized ArrayList but also iterators that will not throw ConcurrentModificationException as they are working off of a non-live snapshot of the data.
You might find some of these recent blogs on these topics interesting:
Java Concurrency Bugs #3 - atomic + atomic != atomic
Java Concurrency Bugs #4: ConcurrentModificationException
CopyOnWriteArrayList concurrency fun

I strongly recommend the book "Java Concurrency in Practice".
Each of the choices has advantages/disadvantages:
Vector - considered "obsolete". It may get less attention and bug fixes than more mainstream collections.
Your own synchronization blocks - Very easy to get incorrect. Often gives poorer performance than the choices below.
Collections.synchronizedList() - Choice 2 done by experts. This is still not complete, because of multi-step operations that need to be atomic (get/modify/set or iteration).
New classes from java.util.concurrent - Often have more efficient algorithms than choice 3. Similar caveats about multi-step operations apply but tools to help you are often provided.

I can't think of a good reason to ever prefer Vector over ArrayList. List operations on a Vector are synchronized, meaning multiple threads can alter it safely. And like you say, ArrayList's operations can be synchronized using Collections.synchronizedList.
Remember that even when using synchronized lists, you will still encounter ConcurrentModificationExceptions if iterating over the collection while it's being modified by another thread. So it's important to coordinate that access externally (your second option).
One common technique used to avoid the iteration problem is to iterate over immutable copies of the collection. See Collections.unmodifiableList

i would always go to the java.util.concurrent (http://java.sun.com/javase/6/docs/api/java/util/concurrent/package-summary.html) package first and see if there is any suitable collection. the java.util.concurrent.CopyOnWriteArrayList class is good if you are doing few changes to the list but more iterations.
also I don't believe Vector and Collections.synchronizedList will prevent ConcurrentModificationException.
if you don't find a suitable collection then you'd have to do your own synchronization, and if you don't want to hold a lock when iterating you may want to consider making a copy and iterating the copy.

I don't believe that the Iterator returned by a Vector is in any way synchronized - meaning that Vector can't guarantee (on it's own) that one thread isn't modifying the underlying collection while another thread is iterating through it.
I believe that to ensure that iterating is thread-safe, you will have to handle the synchronization on your own. Assuming that the same Vector/List/object is being shared by multiple threads (and thus leading to your issues), can't you just synchronize on that object itself?

The safest solution is to avoid concurrent access to shared data altogether. Instead of having non-EDT threads operate on the same data, have them call SwingUtilities.invokeLater() with a Runnable that performs the modifications.
Seriously, shared-data concurrency is a viper's nest where you'll never know if there isn't another race condition or deadlock hiding somewhere, biding its time to bite you in the ass at the worst possible occasion.

CopyOnWriteArrayList is worthwhile to look at. It is designed for a list that is usually read from. Every write would cause it to create an new array behind the covers so those iterating across the array would not get a ConcurrentModificationException

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.