In Java, sometimes when accessing the same variable from different threads, each thread will create its own copy of the variable, and so if I set the value of the variable in one thread to 10 and then I tried to read the value of this variable from another thread, I will not get 10 (because the second thread is reading from another copy of the variable!).
To fix this problem in Java, all I had to do is to use the keyword volatile, for example:
volatile int i = 123;
Does this problem also exists in C++? If so, how can I fix it?
Note: I am using Visual C++ 2010.
Yes, the same problem exists in C++. But since C already introduce the keyword volatile with a different meaning (not related to threads), and C++ used they keyword in the same way, you can't use volatile in C++ like you can in Java.
Instead, you're probably better off using std::atomic<T> (or boost::). It's not always the most efficient alternative, but it's simple. If this turns out to be a bottleneck, you can relax the std::memory_order used by std::atomic.
Having said that about standard C++, MSVC++ as an extension does guarantee that multiple threads can access a shared volatile variable. IIRC, all threads will eventually see the same value, and no threads will travel back in time. (That is to say, if 0 and 1 are written to a variable sequentially, no thread will ever see the sequence 1,0)
Related
I am writing a multithreaded webcrawler, where there is one WebCrawler object which uses an ExecutorService to process WebPages and extract anchors from each page. I have a method defined in the WebCrawler class which can be called by WebPages to add extracted sublinks to the WebCrawler's Set of nextPagestoVisit, and the method currently looks like this:
public synchronized void addSublinks(Set<WebPage> sublinks) {
this.nextPagestoVisit.addAll(sublinks);
}
Currently I am using a synchronized method. However, I am considering other possible options.
Making the Set a synchronizedSet:
public Set<WebPage> nextPagestoVisit = Collections.synchronizedSet(new HashSet<WebPage>());
Making the Set volatile:
public volatile Set<WebPage> nextPagestoVisit = new HashSet<WebPage>();
Are both of these two alternatives sufficient on their own? (I am assuming that the synchronized method approach is sufficient). Or would I have to combine them with other safety measures? If they all work, which one would be the best approach? If one or both do not work, please provide a short explanation of why (ie. what kind of scenario would cause problems). Thanks
Edit: To be clear, my goal is to ensure that if two WebPages both try to add their sublinks at the same time, one write will not be overwritten by the other (ie. all sublinks will successfully be added to the Set).
Making the variable that holds the set volatile will do nothing for you. For a start this only affects the "pointer" to the set, not the set itself. Then it means the atomic updates to the pointer will be seen by all threads. It does nothing for the Set.
Making the Set a synchronizedSet does what you want. As would either synchronized blocks or Semaphores. However both would add more boilerplate than just using synchronizedSet and are an additional vector for bugs.
I am not sure that you know what the volatile keyword actually does. It does not ensure mutual exclusion. Quoting from here :
"Using volatile, on the other hand, forces all accesses (read or write) to the volatile variable to occur to main memory, effectively keeping the volatile variable out of CPU caches. This can be useful for some actions where it is simply required that visibility of the variable be correct and order of accesses is not important."
You do have however several alternatives:
Using a synchronized block
synchronized {
//synchronized code
}
Using alternatives like semaphores
Semaphore semaphore,
semaphore.aquire()
...
semaphore.release()
Again, note that you are saying you are trying to achieve synchronized access. If all you need is to ensure that the variable is the freshest possible always the volatile is a fairly simple solution.
From this post: http://www.javamex.com/tutorials/synchronization_volatile_typical_use.shtml
public class StoppableTask extends Thread {
private volatile boolean pleaseStop;
public void run() {
while (!pleaseStop) {
// do some stuff...
}
}
public void tellMeToStop() {
pleaseStop = true;
}
}
If the variable were not declared volatile (and without other
synchronization), then it would be legal for the thread running the
loop to cache the value of the variable at the start of the loop and
never read it again.
In Java 5 or later:
is the last paragraph correct?
So, exactly at what moment can a thread cache the value of the pleaseStop variable (and for how long)? just before calling one of StoppableTask's functions (run, tellMeTpStop) of the object? (and the thread must update the variable when exiting the function at the latest?)
can you point me to a documentation/tutorial reference about this (Java 5 or later)?
Update: here it is my compilation of answers posted on this question:
Without using volatile nor synchronized, there are actually two problems with the above program:
1- Threads can cache the variable pleaseStop since the very first moment that the thread starts and don't update it never again. so, the loop would keep going forever. this can be solved by either using volatile or synchronized. This thread cache mechanism does not exist in C.
2- The java compiler can optimise the code, and replace while(!pleaseStop) {...} to if (!pleaseStop) { while (true) {...}}. so, the loop would keep going forever. again, this can be solved by either using volatile or synchronized. This compiler optimisation exists also in C.
Some more info:
https://www.ibm.com/developerworks/library/j-5things15/
When can it cache?
As for your question about "when can it cache" the value, the answer to that is "always". To understand what that means, read on. Processors have storage called caches, which make it possible for the running thread to access values in memory by reading from the cache rather than from memory. The running thread can also write to this cache as if it were writing the value to memory. Thus, so long as the thread is running, it could be using the cache to store the data it's using. Something has to explicitly happen to flush the value from the cache to memory. For a single-threaded process, this is all well and dandy, but if you have another thread, it might be trying to read the data from memory while the other thread is plugging away reading and writing it to the processor cache without flushing to memory.
How long can it cache?
As for the "for how long" part- the answer is unfortunately forever unless you do something about it. Synchronizing on the data in question is one way to force a flush from the cache so that all threads see the updates to the value. For more detail about ways to cause a flush, see the next section.
Where's some Documentation?
As for the "where's the documentation" question, a good place to start is here. For specifically how you can force a flush, java refers to this by discussing whether one action (such as a data write) appears to "happen before" another (like a data read). For more about this, see here.
What about volatile?
volatile in essence prevents the type of processor caching described above. This ensures that all writes to a variable are visible from other threads. To learn more, the tutorial you linked to in your post seems like a good start.
The relevant documentation is on the volatile keyword (Java Language Specification, Chapter 8.3.1.4) here and the Java memory model (Java Language Specification, Chapter 17.4) here
Declaring the parameter volatile ensures that there is some synchronization of actions by other threads that might change its value. Without declaring volatile, Java can reorder operations taken on a parameter by different threads.
As the Spec says (see 8.3.1.4), for parameters declared volatile,"accesses ... occur exactly as many times, and in exactly the same order, as they appear to occur during execution of the program text by each thread..."
So the caching you speak of can happen anytime if the parameter is not volatile. But there is enforcement of consistent access to that parameter by the Java memory model if the parameter is declared volatile. But no such enforcement would take place if not (unless the threads are synchronized).
The official documentation is in section 17 of the Java Language Specification, especially 17.4 Memory Model.
The correct viewpoint is to start by assuming multi-threaded code won't work, and try to force it to work whether it likes it or not. Without the volatile declaration, or similar, there would be nothing forcing the read of pleaseStop to ever see the write if it happens in another thread.
I agree with the Java Concurrency in Practice recommendation. It is a good distillation of the implications of the JLS material for practical Java programming.
If I have an unsynchronized java collection in a multithreaded environment, and I don't want to force readers of the collection to synchronize[1], is a solution where I synchronize the writers and use the atomicity of reference assignment feasible? Something like:
private Collection global = new HashSet(); // start threading after this
void allUpdatesGoThroughHere(Object exampleOperand) {
// My hypothesis is that this prevents operations in the block being re-ordered
synchronized(global) {
Collection copy = new HashSet(global);
copy.remove(exampleOperand);
// Given my hypothesis, we should have a fully constructed object here. So a
// reader will either get the old or the new Collection, but never an
// inconsistent one.
global = copy;
}
}
// Do multithreaded reads here. All reads are done through a reference copy like:
// Collection copy = global;
// for (Object elm: copy) {...
// so the global reference being updated half way through should have no impact
Rolling your own solution seems to often fail in these type of situations, so I'd be interested in knowing other patterns, collections or libraries I could use to prevent object creation and blocking for my data consumers.
[1] The reasons being a large proportion of time spent in reads compared to writes, combined with the risk of introducing deadlocks.
Edit: A lot of good information in several of the answers and comments, some important points:
A bug was present in the code I posted. Synchronizing on global (a badly named variable) can fail to protect the syncronized block after a swap.
You could fix this by synchronizing on the class (moving the synchronized keyword to the method), but there may be other bugs. A safer and more maintainable solution is to use something from java.util.concurrent.
There is no "eventual consistency guarantee" in the code I posted, one way to make sure that readers do get to see the updates by writers is to use the volatile keyword.
On reflection the general problem that motivated this question was trying to implement lock free reads with locked writes in java, however my (solved) problem was with a collection, which may be unnecessarily confusing for future readers. So in case it is not obvious the code I posted works by allowing one writer at a time to perform edits to "some object" that is being read unprotected by multiple reader threads. Commits of the edit are done through an atomic operation so readers can only get the pre-edit or post-edit "object". When/if the reader thread gets the update, it cannot occur in the middle of a read as the read is occurring on the old copy of the "object". A simple solution that had probably been discovered and proved to be broken in some way prior to the availability of better concurrency support in java.
Rather than trying to roll out your own solution, why not use a ConcurrentHashMap as your set and just set all the values to some standard value? (A constant like Boolean.TRUE would work well.)
I think this implementation works well with the many-readers-few-writers scenario. There's even a constructor that lets you set the expected "concurrency level".
Update: Veer has suggested using the Collections.newSetFromMap utility method to turn the ConcurrentHashMap into a Set. Since the method takes a Map<E,Boolean> my guess is that it does the same thing with setting all the values to Boolean.TRUE behind-the-scenes.
Update: Addressing the poster's example
That is probably what I will end up going with, but I am still curious about how my minimalist solution could fail. – MilesHampson
Your minimalist solution would work just fine with a bit of tweaking. My worry is that, although it's minimal now, it might get more complicated in the future. It's hard to remember all of the conditions you assume when making something thread-safe—especially if you're coming back to the code weeks/months/years later to make a seemingly insignificant tweak. If the ConcurrentHashMap does everything you need with sufficient performance then why not use that instead? All the nasty concurrency details are encapsulated away and even 6-months-from-now you will have a hard time messing it up!
You do need at least one tweak before your current solution will work. As has already been pointed out, you should probably add the volatile modifier to global's declaration. I don't know if you have a C/C++ background, but I was very surprised when I learned that the semantics of volatile in Java are actually much more complicated than in C. If you're planning on doing a lot of concurrent programming in Java then it'd be a good idea to familiarize yourself with the basics of the Java memory model. If you don't make the reference to global a volatile reference then it's possible that no thread will ever see any changes to the value of global until they try to update it, at which point entering the synchronized block will flush the local cache and get the updated reference value.
However, even with the addition of volatile there's still a huge problem. Here's a problem scenario with two threads:
We begin with the empty set, or global={}. Threads A and B both have this value in their thread-local cached memory.
Thread A obtains obtains the synchronized lock on global and starts the update by making a copy of global and adding the new key to the set.
While Thread A is still inside the synchronized block, Thread B reads its local value of global onto the stack and tries to enter the synchronized block. Since Thread A is currently inside the monitor Thread B blocks.
Thread A completes the update by setting the reference and exiting the monitor, resulting in global={1}.
Thread B is now able to enter the monitor and makes a copy of the global={1} set.
Thread A decides to make another update, reads in its local global reference and tries to enter the synchronized block. Since Thread B currently holds the lock on {} there is no lock on {1} and Thread A successfully enters the monitor!
Thread A also makes a copy of {1} for purposes of updating.
Now Threads A and B are both inside the synchronized block and they have identical copies of the global={1} set. This means that one of their updates will be lost! This situation is caused by the fact that you're synchronizing on an object stored in a reference that you're updating inside your synchronized block. You should always be very careful which objects you use to synchronize. You can fix this problem by adding a new variable to act as the lock:
private volatile Collection global = new HashSet(); // start threading after this
private final Object globalLock = new Object(); // final reference used for synchronization
void allUpdatesGoThroughHere(Object exampleOperand) {
// My hypothesis is that this prevents operations in the block being re-ordered
synchronized(globalLock) {
Collection copy = new HashSet(global);
copy.remove(exampleOperand);
// Given my hypothesis, we should have a fully constructed object here. So a
// reader will either get the old or the new Collection, but never an
// inconsistent one.
global = copy;
}
}
This bug was insidious enough that none of the other answers have addressed it yet. It's these kinds of crazy concurrency details that cause me to recommend using something from the already-debugged java.util.concurrent library rather than trying to put something together yourself. I think the above solution would work—but how easy would it be to screw it up again? This would be so much easier:
private final Set<Object> global = Collections.newSetFromMap(new ConcurrentHashMap<Object,Boolean>());
Since the reference is final you don't need to worry about threads using stale references, and since the ConcurrentHashMap handles all the nasty memory model issues internally you don't have to worry about all the nasty details of monitors and memory barriers!
According to the relevant Java Tutorial,
We have already seen that an increment expression, such as c++, does not describe an atomic action. Even very simple expressions can define complex actions that can decompose into other actions. However, there are actions you can specify that are atomic:
Reads and writes are atomic for reference variables and for most primitive variables (all types except long and double).
Reads and writes are atomic for all variables declared volatile (including long and double variables).
This is reaffirmed by Section §17.7 of the Java Language Specification
Writes to and reads of references are always atomic, regardless of whether they are implemented as 32-bit or 64-bit values.
It appears that you can indeed rely on reference access being atomic; however, recognize that this does not ensure that all readers will read an updated value for global after this write -- i.e. there is no memory ordering guarantee here.
If you use an implicit lock via synchronized on all access to global, then you can forge some memory consistency here... but it might be better to use an alternative approach.
You also appear to want the collection in global to remain immutable... luckily, there is Collections.unmodifiableSet which you can use to enforce this. As an example, you should likely do something like the following...
private volatile Collection global = Collections.unmodifiableSet(new HashSet());
... that, or using AtomicReference,
private AtomicReference<Collection> global = new AtomicReference<>(Collections.unmodifiableSet(new HashSet()));
You would then use Collections.unmodifiableSet for your modified copies as well.
// ... All reads are done through a reference copy like:
// Collection copy = global;
// for (Object elm: copy) {...
// so the global reference being updated half way through should have no impact
You should know that making a copy here is redundant, as internally for (Object elm : global) creates an Iterator as follows...
final Iterator it = global.iterator();
while (it.hasNext()) {
Object elm = it.next();
}
There is therefore no chance of switching to an entirely different value for global in the midst of reading.
All that aside, I agree with the sentiment expressed by DaoWen... is there any reason you're rolling your own data structure here when there may be an alternative available in java.util.concurrent? I figured maybe you're dealing with an older Java, since you use raw types, but it won't hurt to ask.
You can find copy-on-write collection semantics provided by CopyOnWriteArrayList, or its cousin CopyOnWriteArraySet (which implements a Set using the former).
Also suggested by DaoWen, have you considered using a ConcurrentHashMap? They guarantee that using a for loop as you've done in your example will be consistent.
Similarly, Iterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration.
Internally, an Iterator is used for enhanced for over an Iterable.
You can craft a Set from this by utilizing Collections.newSetFromMap like follows:
final Set<E> safeSet = Collections.newSetFromMap(new ConcurrentHashMap<E, Boolean>());
...
/* guaranteed to reflect the state of the set at read-time */
for (final E elem : safeSet) {
...
}
I think your original idea was sound, and DaoWen did a good job getting the bugs out. Unless you can find something that does everything for you, it's better to understand these things than hope some magical class will do it for you. Magical classes can make your life easier and reduce the number of mistakes, but you do want to understand what they are doing.
ConcurrentSkipListSet might do a better job for you here. It could get rid of all your multithreading problems.
However, it is slower than a HashSet (usually--HashSets and SkipLists/Trees hard to compare). If you are doing a lot of reads for every write, what you've got will be faster. More importantly, if you update more than one entry at a time, your reads could see inconsistent results. If you expect that whenever there is an entry A there is an entry B, and vice versa, the skip list could give you one without the other.
With your current solution, to the readers, the contents of the map are always internally consistent. A read can be sure there's an A for every B. It can be sure that the size() method gives the precise number of elements that will be returned by the iterator. Two iterations will return the same elements in the same order.
In other words, allUpdatesGoThroughHere and ConcurrentSkipListSet are two good solutions to two different problems.
Can you use the Collections.synchronizedSet method? From HashSet Javadoc http://docs.oracle.com/javase/6/docs/api/java/util/HashSet.html
Set s = Collections.synchronizedSet(new HashSet(...));
Replace the synchronized by making global volatile and you'll be alright as far as the copy-on-write goes.
Although the assignment is atomic, in other threads it is not ordered with the writes to the object referenced. There needs to be a happens-before relationship which you get with a volatile or synchronising both reads and writes.
The problem of multiple updates happening at once is separate - use a single thread or whatever you want to do there.
If you used a synchronized for both reads and writes then it'd be correct but the performance may not be great with reads needing to hand-off. A ReadWriteLock may be appropriate, but you'd still have writes blocking reads.
Another approach to the publication issue is to use final field semantics to create an object that is (in theory) safe to be published unsafely.
Of course, there are also concurrent collections available.
I have the following protobuf msg defined:
message Counts {
repeated int32 counts = 1;
}
which is shared between threads R and W as a builder:
private final Counts.Builder countsBuilder;
Thread R will only read from countsBuilder and W will only write to countsBuilder.
The shared builder will be read, written-to and (at some point) built & sent over the network.
AFAIK, concurrent reads to messages are fine, but anything else must be synchronized at a higher level by the developer? So, I can't actually write and read to the shared builder at the same time?
If this is not inherently thread-safe, I'm thinking of using some kind of thread-safe Collection<Integer> which I'll use for reading/writing and will (at some point) create a brand new message right before sending it over the network. Or am I missing something?
Thanks!
EDIT 1: I'm using protobuf 2.4.1 and java 6
EDIT 2: Some terminology and spelling fixes.
You should be fine if you synchronize both your read and writes:
synchronized (countsBuilder) {
// modify countsBuilder
}
But remember that you also need to make sure that there aren't any race conditions when building the message; the writer thread is not allowed to make any writes after the message has been built.
according to https://developers.google.com/protocol-buffers/docs/reference/cpp It's not thread-safe in C++.
And Also not in java https://developers.google.com/protocol-buffers/docs/reference/java-generated
I encountered this problem recently, And I found this. Here is what it said on C++
A note on thread-safety:
Thread-safety in the Protocol Buffer library follows a simple rule: unless explicitly noted otherwise, it is always safe to use an object from multiple threads simultaneously as long as the object is declared const in all threads (or, it is only used in ways that would be allowed if it were declared const). However, if an object is accessed in one thread in a way that would not be allowed if it were const, then it is not safe to access that object in any other thread simultaneously.
Put simply, read-only access to an object can happen in multiple threads simultaneously, but write access can only happen in a single thread at a time.
Said on java:
Note that builders are not thread-safe, so Java synchronization should be used whenever it is necessary for multiple different threads to be modifying the contents of a single builder.
Is it true that if I only use immutable data type, my Java program would be thread safe?
Any other factors will affect the thread safety?
****Would appreciate if can provide an example. Thanks!**
**
Thread safety is about protecting shared data and immutable objects are protected as they are read only. Well apart from when you create them but creating a object is thread safe.
It's worth saying that designing a large application that ONLY uses immutable objects to achieve thread safety would be difficult.
It's a complicated subject and I would recommend you reading Java Concurrency in Practice
which is a very good place to start.
It is true. The problem is that it's a pretty serious limitation to place on your application to only use immutable data types. You can't have any persistent objects with state which exist across threads.
I don't understand why you'd want to do it, but that doesn't make it any less true.
Details and example: http://www.javapractices.com/topic/TopicAction.do?Id=29
If every single variable is immutable (never changed once assigned) you would indeed have a trivially thread-safe program.
Functional programming environments takes advantage of this.
However, it is pretty difficult to do pure functional programming in a language not designed for it from the ground up.
A trivial example of something you can't do in a pure functional program is use a loop, as you can't increment a counter. You have to use recursive functions instead to achieve the same effect.
If you are just straying into the world of thread safety and concurrency, I'd heartily recommend the book Java Concurrency in Practice, by Goetz. It is written for Java, but actually the issues it talks about are relevant in other languages too, even if the solutions to those issues may be different.
Immutability allows for safety against certain things that can go wrong with multi-threaded cases. Specifically, it means that the properties of an object visible to one thread cannot be changed by another thread while that first thread is using it (since nothing can change it, then clearly another thread can't).
Of course, this only works as far as that object goes. If a mutable reference to the object is also shared, then some cases of cross-thread bugs can happen by something putting a new object there (but not all, since it may not matter if a thread works on an object that has already been replaced, but then again that may be crucial).
In all, immutability should be considered one of the ways that you can ensure thread-safety, but neither the sole way nor necessarily sufficient in itself.
Although immutable objects are a help with thread safety, you may find "local variables" and "synchronize" more practical for real world progamming.
Any program where no mutable aspect of program state is accessed by more than one thread will be trivally thread-safe, as each thread may as well be its own separate program. Useful multi-threading, however, generally requires interaction between threads, which implies the existence of some mutable shared state.
The key to safe and efficient multi-threading is to incorporate mutability at the right "design level". Ideally, each aspect of program state should be representable by one immutably-rooted(*), mutable reference to an object whose observable state is immutable. Only one thread at a time may try to change the state represented by a particular mutable reference. Efficient multi-threading requires that the "mutable layer" in a program's state be low enough that different threads can use different parts of it. For example, if one has an immutable AllCustomers data structure and two threads simultaneously attempted to change different customers, each would generate a version of the AllCustomers data structure which included its own changes, but not that of the other thread. No good. If AllCustomers were a mutable array of CustomerState objects, however, it would be possible for one thread to be working on AllCustomers[4] while another was working on AllCustomers[9], without interference.
(*) The rooted path must exist when the aspect of state becomes relevant, and must not change while the access is relevant. For example, one could design an AddOnlyList<thing> which hold a thing[][] called Arr that was initialized to size 32. When the first thing is added, Arr[0] would be initialized, using CompareExchange, to an array of 16 thing. The next 15 things would go in that array. When the 17th thing is added, Arr[1] would be initialized using CompareExchange to an array of size 32 (which would hold the new item and the 31 items after it). When the 49th thing is added, Arr[2] would be initialized for 64 items. Note that while thing itself and the arrays contained thereby would not be totally immutable, only the very first access to any element would be a write, and once Arr[x][y] holds a reference to something, it would continue to do so as long as Arr exists.