Iterator invalidation rules in Java

Iterator invalidation rules in Java - java

I'm researching about iterator invalidation rules in Java, but I couldn't find proper information like this one for C++. All things that I found for java is more generic like this one. Is there a documentation that I could follow?

Java "Collections Framework Overview" documentation says
The general-purpose implementations support all of the optional operations in the collection interfaces and have no restrictions on the elements they may contain. They are unsynchronized, but the Collections class contains static factories called synchronization wrappers that can be used to add synchronization to many unsynchronized collections. All of the new implementations have fail-fast iterators, which detect invalid concurrent modification, and fail quickly and cleanly (rather than behaving erratically).
Java has concurrent thread safe collections implementations. They are part of java.util.concurrent package, which doc says
Most concurrent Collection implementations (including most Queues)
also differ from the usual java.util conventions in that their
Iterators and Spliterators provide weakly consistent rather than
fast-fail traversal:
they may proceed concurrently with other operations
they will never throw ConcurrentModificationException
they are guaranteed to traverse elements as they existed upon construction exactly once, and may (but are not guaranteed to) reflect
any modifications subsequent to construction.
For example for ConcurrentHashMap
Similarly, Iterators, Spliterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration. They do not throw ConcurrentModificationException. However, iterators are designed to be used by only one thread at a time. Bear in mind that the results of aggregate status methods including size, isEmpty, and containsValue are typically useful only when a map is not undergoing concurrent updates in other threads. Otherwise the results of these methods reflect transient states that may be adequate for monitoring or estimation purposes, but not for program control.
So the short answer is: if you want to iterate the collection while it may change by another thread, just use concurrent implementation collections. This java iterator is never invalidated in "C++ meaning"
Or just use thread-unsafe collections and catch ConcurrentModificationException for fixing the collection modification issue. In this case, java iterator also is never invalidated in "C++ meaning".

Related

Weakly consistent iterator by ConcurrentHashMap

The Java Concurrency in Practice mentions that:
The iterator returned by the ConcurrentHashMap are weakly consistent
than fail-fast. A weakly consistent iterator can tolerate the
concurrent modifications, traverses elements as they existed when the
iterator was constructed, and may (but is not guaranteed to) reflect modifications to the collection after the construction of the iterator.
How making the iterator weakly consistent or fail-safe helps in the concurrent environment because still state of the ConcurrentHashMap will be modified. The only thing is that it'll not throw the ConcurrentModificationException.
Why fail-fast iterator is returned by the Collections when creating the fail-safe iterator is good for concurrency.

Correctness in your particular case
Please keep in mind that Fail Fast iterator iterates over the original collection.
In contrast Fail Safe (a.k.a weakly consistent) iterator iterates over a copy of the original collection. Therefore any changes to the original collection go unnoticed, and that's how it guarantees lack of ConcurrentModificationExceptions.
To answer your questions:
Using Fail Safe iterator helps concurrency as you don't have to block on the reading threads on the whole collection. Collection can be modified underneath while the reading happens. The drawback is that the reading thread will see the state of the collection as a snapshot taken at the time when the iterator got created.
If the above limitation is not good for your particular use case (your readers should always see the same state of the collection) you have to use Fail Fast iterator and keep the concurrent access to the collection controlled tighter.
As you can see it's a trade-off between correctness of your use case and speed.
ConcurrentHashMap
ConcurrentHashMap (CHM) exploits multiple tricks in order to increase concurrency of access.
Firstly CHM is actually a grouping of multiple maps; each MapEntry gets stored in one of the number of segments each itself being a hashtable which can be concurrently read (read methods do not block).
The number of segments is the last argument in the 3 argument constructor and it is called concurrencyLevel (default 16). The number of segments determines the number of concurrent writers across the whole of the data. The equal spread of entries between the segments is ensured by additional internal hashing algorithm.
Each HashMapEntrys value is volatile thereby ensuring fine grain consistency for contended modifications and subsequent reads; each read reflects the most recently completed update
Iterators and Enumerations are Fail Safe - reflecting the state at some point since the creation of iterator/enumeration; this allows for simultaneous reads and modifications at the cost of reduced consistency.

TL;DR: Because locking.
If you want a consistent iterator, then you have to lock all modifications to the Map - this is a massive penalty in a concurrent environment.
You can of course do this manually if that is what you want, but iterating over a Map is not its purpose so the default behaviour allows for concurrent writes while iterating.
The same argument does not apply for normal collections, which are only (allowed to be) accessed by a single thread. Iteration over an ArrayList is expected to be consistent, so the fail fast iterators enforce consistency.

First of all, the iterators of concurrent collections are not fail-safe because they do not have failure modes which they could somehow handle with some kind of emergency procedure. They simply do not fail.
The iterators of the non-concurrent collections are fail-fast because of performance reasons they are designed in a way that does not allow the internal structure of the collection they iterate over to be modified. E.g. a hashmap's iterator would not know how to continue iterating after the reshuffling that happens when a hashmap gets resized.
That means they would not just fail because other threads access them, they would also fail if the current thread performs a modification that invalidates the assumptions of the iterator.
Instead of ignoring those troublesome modifications and returning unpredictable and corrupted results those collections instead try to track modifications and throw an exception during iteration to inform the programmer that something is wrong. This is called fail-fast.
Those fail-fast mechanisms are not thread-safe either. Which means if the illegal modifications don't happen from the current thread but from a different threads they are not guaranteed to be detected anymore. In that case it can only be thought of as a best-effort failure detection mechanism.
On the other hand concurrent collections must be designed in a manner that can deal with multiple writes and reads at the same time and the underlying structure changing constantly.
So iterators can't always assume that the underlying structure is never modified during iteration.
Instead they're designed to provide weaker guarantees, such as either iterating over outdated data or maybe also showing some but not all updates that happened after the creation of the iterator. Which also means that they might return outdated data when they are modified during iteration within a single thread, which might be somewhat counter-intuitive for a programmer as one would usually expect immediate visibility of modifications within a single thread.
Examples:
HashMap: best-effort fail-fast iterator.
iterator supports removal
structural modification from same thread, such as clear()ing the Map during iteration: guaranteed to throw a ConcurrentModificationException on the next iterator step
structural modification from different thread during iteration: iterator usually throws an exception, but might also cause inconsistent, unpredictable behavior
CopyOnWriteArrayList: snapshot iterator
iterator does not support removal
iterator shows a view on the items frozen at the time it was created
collection can be modified by any thread including the current one during iteration without causing an exception, but it has no effect on the items visited by the iterator
clear()ing the list will not stop iteration
iterator never throws CME
ConcurrentSkipListMap: weakly consistent iterator
iterator supports removal, but may cause surprising behavior since it's solely based on Map keys, not the current value
iterator may see updates that happened since its creation but is not guaranteed to. that means for example that clear()ing the Map may or may not stop iteration and removing entries may or may not stop them from showing up during the remaining iteration
iterator never throws CME

Ways to avoid Iterator ConcurrentModificationException

As far as I know there are two ways to avoid ConcurrentModificationException while one threading iterates the collection and another thread modifies the collection.
client-locking, basically lock the collection during the iteration. Other threads that need to access the collection will block until the iteration is complete.
"thread-confined" that clones the collection and iterate the copy.
I am wondering are there any other alternatives ?
because the first way obvious is undesirable and poor performance-wise, if the collection is large that other threads could wait for a long time. second way I am not sure that since we clone the collection, and iterate the copy, so if other threads come in and modify the original one, then the copied one becomes stale right ? does that mean we need to restart over by cloning and iterate it again once it's modified ?

I am wondering are there any other alternatives ?
Use one of the concurrent collections which doesn't throw this exception. Instead they provide weak consistency. i.e. an added or delete element may or may not appear while iterating.
http://docs.oracle.com/javase/tutorial/essential/concurrency/collections.html
The java.util.concurrent package includes a number of additions to the Java Collections Framework. These are most easily categorized by the collection interfaces provided:
BlockingQueue defines a first-in-first-out data structure that blocks or times out when you attempt to add to a full queue, or retrieve from an empty queue.
ConcurrentMap is a subinterface of java.util.Map that defines useful atomic operations. These operations remove or replace a key-value pair only if the key is present, or add a key-value pair only if the key is absent. Making these operations atomic helps avoid synchronization. The standard general-purpose implementation of ConcurrentMap is ConcurrentHashMap, which is a concurrent analog of HashMap.
ConcurrentNavigableMap is a subinterface of ConcurrentMap that supports approximate matches. The standard general-purpose implementation of ConcurrentNavigableMap is ConcurrentSkipListMap, which is a concurrent analog of TreeMap.

you could use Class's from java.util.Concurrent like CopyOnWriteArrayList

How does ConcurrentHashMap work internally?

I was reading the official Oracle documentation about Concurrency in Java and I was wondering what could be the difference between a Collection returned by
public static <T> Collection<T> synchronizedCollection(Collection<T> c);
and using for example a
ConcurrentHashMap. I'm assuming that I use synchronizedCollection(Collection<T> c) on a HashMap. I know that in general a synchronized collection is essentially just a decorator for my HashMap so it is obvious that a ConcurrentHashMap has something different in its internals. Do you have some information about those implementation details?
Edit: I realized that the source code is publicly available:
ConcurrentHashMap.java

I would read the source of ConcurrentHashMap as it is rather complicated in the detail. In short it has
Multiple partitions which can be locked independently. (16 by default)
Using concurrent Locks operations for thread safety instead of synchronized.
Has thread safe Iterators. synchronizedCollection's iterators are not thread safe.
Does not expose the internal locks. synchronizedCollection does.

The ConcurrentHashMap is very similar to the java.util.HashTable class, except that ConcurrentHashMap offers better concurrency than HashTable or synchronizedMap does. ConcurrentHashMap does not lock the Map while you are reading from it. Additionally,ConcurrentHashMap does not lock the entire Mapwhen writing to it. It only locks the part of the Map that is being written to, internally.
Another difference is that ConcurrentHashMap does not throw ConcurrentModificationException if the ConcurrentHashMap is changed while being iterated. The Iterator is not designed to be used by more than one thread though whereas synchronizedMap may throw ConcurrentModificationException

This is the article that helped me understand it Why ConcurrentHashMap is better than Hashtable and just as good as a HashMap
Hashtable’s offer concurrent access to their entries, with a small caveat, the entire map is locked to perform any sort of operation.
While this overhead is ignorable in a web application under normal
load, under heavy load it can lead to delayed response times and
overtaxing of your server for no good reason.
This is where ConcurrentHashMap’s step in. They offer all the features
of Hashtable with a performance almost as good as a HashMap.
ConcurrentHashMap’s accomplish this by a very simple mechanism.
Instead of a map wide lock, the collection maintains a list of 16
locks by default, each of which is used to guard (or lock on) a single
bucket of the map. This effectively means that 16 threads can modify
the collection at a single time (as long as they’re all working on
different buckets). Infact there is no operation performed by this
collection that locks the entire map. The concurrency level of the
collection, the number of threads that can modify it at the same time
without blocking, can be increased. However a higher number means more
overhead of maintaining this list of locks.

The "scalability issues" for Hashtable are present in exactly the same way in Collections.synchronizedMap(Map) - they use very simple synchronization, which means that only one thread can access the map at the same time.
This is not much of an issue when you have simple inserts and lookups (unless you do it extremely intensively), but becomes a big problem when you need to iterate over the entire Map, which can take a long time for a large Map - while one thread does that, all others have to wait if they want to insert or lookup anything.
The ConcurrentHashMap uses very sophisticated techniques to reduce the need for synchronization and allow parallel read access by multiple threads without synchronization and, more importantly, provides an Iterator that requires no synchronization and even allows the Map to be modified during interation (though it makes no guarantees whether or not elements that were inserted during iteration will be returned).

Returned by synchronizedCollection() is an object all methods of which are synchronized on this, so all concurrent operations on such wrapper are serialized. ConcurrentHashMap is a truly concurrent container with fine grained locking optimized to keep contention as low as possible. Have a look at the source code and you will see what it is inside.

ConcurrentHashMap implements ConcurrentMap which provides the concurrency.
Deep internally its iterators are designed to be used by only one thread at a time which maintains the synchronization.
This map is used widely in concurrency.

How to get a fixed state iterator for a set/map without cloning overheads

I'm looking to avoid a ConcurrentModificationException where the functionality is to iterate over an expanding set (there are no removes), and the add operations are being done by different threads.
I considered cloning the collection before iterating, but if this solution doesn't scale very well as the set becomes large. Synchronizing doesn't work because the collection is being used in tonnes of places and the code is pretty old. Short of a massive refactoring, the only bet is to change the set implementation.
Wondering if there's a Java implementation where the iterator returns a snapshot state of the collection (which is okay for my functionality) but avoid the cost of cloning too much. I checked out CopyOnWriteArrayList but it doesn't fit the bill mainly because of being a list.

The java.util.concurrent package has everything you need.
The classes there are like the java.util collections, but are highly optimized to cater for concurrent access, interestingly addressing specifically your comment:
the iterator returns a snapshot state of the collection
Don't reinvent the wheel :)

Wondering if there's a Java implementation where the iterator returns a snapshot state of the collection
Yes, there is. Unlike the synchronized collections made available via the Collections.synchronizedxxx() methods, the Concurrentxxx classes in java.util.concurrent package would allow for this scenario. The concurrent collection classes allow for multiple threads to access the collection at the same point in time, without the need to synchronize on a lock.
Depending on the exact nature of your problem, ConcurrentHashMaps can be used. The relevant section of the documentation of the class, that applies to your problem is:
Iterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration. They do not throw ConcurrentModificationException. However, iterators are designed to be used by only one thread at a time.
Note the last sentence carefully.
Also, remember that these are not consistent snapshots of the collection being returned. The iterators from most methods returned possess the following property:
The view's iterator is a "weakly consistent" iterator that will never throw ConcurrentModificationException, and guarantees to traverse elements as they existed upon construction of the iterator, and may (but is not guaranteed to) reflect any modifications subsequent to construction.
Related questions
Is iterating ConcurrentHashMap values thread safe?
Java ConcurrentHashMap not thread safe.. wth?

Best way to control concurrent access to Java collections

Should I use old synchronized Vector collection, ArrayList with synchronized access or Collections.synchronizedList or some other solution for concurrent access?
I don't see my question in Related Questions nor in my search (Make your collections thread-safe? isn't the same).
Recently, I had to make kind of unit tests on GUI parts of our application (basically using API to create frames, add objects, etc.).
Because these operations are called much faster than by a user, it shown a number of issues with methods trying to access resources not yet created or already deleted.
A particular issue, happening in the EDT, came from walking a linked list of views while altering it in another thread (getting a ConcurrentModificationException among other problems).
Don't ask me why it was a linked list instead of a simple array list (even less as we have in general 0 or 1 view inside...), so I took the more common ArrayList in my question (as it has an older cousin).
Anyway, not super familiar with concurrency issues, I looked up a bit of info, and wondered what to choose between the old (and probably obsolete) Vector (which has synchronized operations by design), ArrayList with a synchronized (myList) { } around critical sections (add/remove/walk operations) or using a list returned by Collections.synchronizedList (not even sure how to use the latter).
I finally chose the second option, because another design mistake was to expose the object (getViewList() method...) instead of providing mechanisms to use it.
But what are the pros and cons of the other approaches?
[EDIT] Lot of good advices here, hard to select one. I will choose the more detailed and providing links/food for thoughts... :-) I like Darron's one too.
To summarize:
As I suspected, Vector (and its evil twin, Hashtable as well, probably) is largely obsolete, I have seen people telling its old design isn't as good as newer collections', beyond the slowness of synchronization forced even in single thread environment. If we keep it around, it is mostly because older libraries (and parts of Java API) still use it.
Unlike what I thought, Collections.synchronizedXxxx aren't more modern than Vector (they appear to be contemporary to Collections, ie. Java 1.2!) and not better, actually. Good to know. In short, I should avoid them as well.
Manual synchronization seems to be a good solution after all. There might be performance issues, but in my case it isn't critical: operations done on user actions, small collection, no frequent use.
java.util.concurrent package is worth keeping in mind, particularly the CopyOnWrite methods.
I hope I got it right... :-)

Vector and the List returned by Collections.synchronizedList() are morally the same thing. I would consider Vector to be effectively (but not actually) deprecated and always prefer a synchronized List instead. The one exception would be old APIs (particularly ones in the JDK) that require a Vector.
Using a naked ArrayList and synchronizing independently gives you the opportunity to more precisely tune your synchronization (either by including additional actions in the mutually exclusive block or by putting together multiple calls to the List in one atomic action). The down side is that it is possible to write code that accesses the naked ArrayList outside synchronization, which is broken.
Another option you might want to consider is a CopyOnWriteArrayList, which will give you thread safety as in Vector and synchronized ArrayList but also iterators that will not throw ConcurrentModificationException as they are working off of a non-live snapshot of the data.
You might find some of these recent blogs on these topics interesting:
Java Concurrency Bugs #3 - atomic + atomic != atomic
Java Concurrency Bugs #4: ConcurrentModificationException
CopyOnWriteArrayList concurrency fun

I strongly recommend the book "Java Concurrency in Practice".
Each of the choices has advantages/disadvantages:
Vector - considered "obsolete". It may get less attention and bug fixes than more mainstream collections.
Your own synchronization blocks - Very easy to get incorrect. Often gives poorer performance than the choices below.
Collections.synchronizedList() - Choice 2 done by experts. This is still not complete, because of multi-step operations that need to be atomic (get/modify/set or iteration).
New classes from java.util.concurrent - Often have more efficient algorithms than choice 3. Similar caveats about multi-step operations apply but tools to help you are often provided.

I can't think of a good reason to ever prefer Vector over ArrayList. List operations on a Vector are synchronized, meaning multiple threads can alter it safely. And like you say, ArrayList's operations can be synchronized using Collections.synchronizedList.
Remember that even when using synchronized lists, you will still encounter ConcurrentModificationExceptions if iterating over the collection while it's being modified by another thread. So it's important to coordinate that access externally (your second option).
One common technique used to avoid the iteration problem is to iterate over immutable copies of the collection. See Collections.unmodifiableList

i would always go to the java.util.concurrent (http://java.sun.com/javase/6/docs/api/java/util/concurrent/package-summary.html) package first and see if there is any suitable collection. the java.util.concurrent.CopyOnWriteArrayList class is good if you are doing few changes to the list but more iterations.
also I don't believe Vector and Collections.synchronizedList will prevent ConcurrentModificationException.
if you don't find a suitable collection then you'd have to do your own synchronization, and if you don't want to hold a lock when iterating you may want to consider making a copy and iterating the copy.

I don't believe that the Iterator returned by a Vector is in any way synchronized - meaning that Vector can't guarantee (on it's own) that one thread isn't modifying the underlying collection while another thread is iterating through it.
I believe that to ensure that iterating is thread-safe, you will have to handle the synchronization on your own. Assuming that the same Vector/List/object is being shared by multiple threads (and thus leading to your issues), can't you just synchronize on that object itself?

The safest solution is to avoid concurrent access to shared data altogether. Instead of having non-EDT threads operate on the same data, have them call SwingUtilities.invokeLater() with a Runnable that performs the modifications.
Seriously, shared-data concurrency is a viper's nest where you'll never know if there isn't another race condition or deadlock hiding somewhere, biding its time to bite you in the ass at the worst possible occasion.

CopyOnWriteArrayList is worthwhile to look at. It is designed for a list that is usually read from. Every write would cause it to create an new array behind the covers so those iterating across the array would not get a ConcurrentModificationException

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.