Java Multi-Threading Beginner Questions

Java Multi-Threading Beginner Questions - java

I am working on a scientific application that has readily separable parts that can proceed in parallel. So, I've written those parts to each run as independent threads, though not for what appears to be the standard reason for separating things into threads (i.e., not blocking some quit command or the like).
A few questions:
Does this actually buy me anything on standard multi-core desktops - i.e., will the threads actually run on the separate cores if I have a current JVM, or do I have to do something else?
I have few objects which are read (though never written) by all the threads. Potential problems with that? Solutions to those problems?
For actual clusters, can you recommend frameworks to distribute the threads to the various nodes so that I don't have to manage that myself (well, if such exist)? CLARIFICATION: by this, I mean either something that automatically converts threads into task for individual nodes or makes the entire cluster look like a single JVM (i.e., so it could send threads to whatever processors it can access) or whatever. Basically, implement the parallelization in a useful way on a cluster, given that I've built it into the algorithm, with the minimal job husbandry on my part.
Bonus: Most of the evaluation consists of set comparisons (e.g., union, intersection, contains) with some mapping from keys to get the pertinent sets. I have some limited experience with FORTRAN, C, and C++ (semester of scientific computing for the first, and HS AP classes 10 years ago for the other two) - what sort of speed/ease of parallelization gains might I find if I tied my Java front-end to an algorithmic back-end in one of those languages, and what sort of headache might my level of experience find implementing those operations in those languages?

Yes, using independent threads will use multiple cores in a normal JVM, without you having to do any work.
If anything is only ever read, it should be fine to be read by multiple threads. If you can make the objects in question immutable (to guarantee they'll never be changed) that's even better
I'm not sure what sort of clustering you're considering, but you might want to look at Hadoop. Note that distributed computing distributes tasks rather than threads (normally, anyway).

Multi-core Usage
Java runtimes conventionally schedule threads to run concurrently on all available processors and cores. I think it's possible to restrict this, but it would take extra work; by default, there is no restriction.
Immutable Objects
For read-only objects, declare their member fields as final, which will ensure that they are assigned when the object is created and never changed. If a field is not final, even if it never changed after construction, there can be some "visibility" issues in a multi-threaded program. This could result in the assignments made by one thread never becoming visible to another.
Any mutable fields that are accessed by multiple threads should be declared volatile, be protected by synchronization, or use some other concurrency mechanism to ensure that changes are consistent and visible among threads.
Distributed Computing
The most widely used framework for distributed processing of this nature in Java is called Hadoop. It uses a paradigm called map-reduce.
Native Code Integration
Integrating with other languages is unlikely to be worthwhile. Because of its adaptive bytecode-to-native compiler, Java is already extremely fast on a wide range of computing tasks. It would be wrong to assume that another language is faster without actual testing. Also, integrating with "native" code using JNI is extremely tedious, error-prone, and complicated; using simpler interfaces like JNA is very slow and would quickly erase any performance gains.

As some people have said, the answers are:
Threads on cores - Yes. Java has had support for native threads for a long time. Most OSes have provided kernel threads which automagically get scheduled to any CPUs you have (implementation performance may vary by OS).
The simple answer is it will be safe in general. The more complex answer is that you have to ensure that your Object is actually created & initialized before any threads can access it. This is solved one of two ways:
Let the class loader solve the problem for you using a Singleton (and lazy class loading):
public class MyImmutableObject
{
private static class MyImmutableObjectInstance {
private static final MyImmutableObject instance = new MyImmutableObject();
}
public MyImmutableObject getInstance() {
return MyImmutableObjectInstance.instance;
}
}
Explicitly using acquire/release semantics to ensure a consistent memory model:
MyImmutableObject foo = null;
volatile bool objectReady = false;
// initializer thread:
....
/// create & initialize object for use by multiple threads
foo = new MyImmutableObject();
foo.initialize();
// release barrier
objectReady = true;
// start worker threads
public void run() {
// acquire barrier
if (!objectReady)
throw new IllegalStateException("Memory model violation");
// start using immutable object foo
}
I don't recall off the top of my head how you can exploit the memory model of Java to perform the latter case. I believe, if I remember correctly, that a write to a volatile variable is equivalent to a release barrier, while a read from a volatile variable is equivalent to an acquire barrier. Also, the reason for making the boolean volatile as opposed to the object is that access of a volatile variable is more expensive due to the memory model constraints - thus, the boolean allows you to enforce the memory model & then the object access can be done much faster within the thread.
As mentioned, there's all sorts of RPC mechanisms. There's also RMI which is a native approach for running code on remote targets. There's also frameworks like Hadoop which offer a more complete solution which might be more appropriate.
For calling native code, it's pretty ugly - Sun really discourages use by making JNI an ugly complicated mess, but it is possible. I know that there was at least one commercial Java framework for loading & executing native dynamic libraries without needing to worry about JNI (not sure if there are any free or OSS projects).
Good luck.

Related

Java uses synchronisation... What does Haskell use?

So I am pretty new to Haskell and would like to know, if synchronisation is used to prevent corruption when multithreading Java, how is this done in Haskell? I've only found useless or overly complicated responses on google.

Your question is a bit ambiguous since one may use multithreading for either concurrency or parallelism, which are distinct problems with distinct solutions.
In both cases, you'll need to make sure your programs are compiled with SMP support and ran using multiple RTS threads: see the GHC manual's section about concurrency.
Concurrency
As others have pointed out, synchronization will be a non problem in the vast majority of your code, since you'll mostly be dealing with pure functions. This is true in any language if you keep mutable state and libraries that rely on it under armed guard religiously avoid mutable state unless it is properly wrapped behind a pure API. Concurrency is an area where Haskell shines because its semantics require purity. Types are used to describe impure operations instead, making it dead easy to spot code where some sort of synchronization might be needed.
Typically, your application's state will be backed by a transactional database which will handle synchronization and persistence for you. You will not need any additional synchronization at all if your concurrent application does not have additional state.
In other cases, haskell has a handy Software Transactional Memory implementation. It allows you to write and compose code written in an imperative-looking style, without explicit locking, while having atomicity and guarantees against deadlocks. It is the foolproof(tm) way to write concurrent code.
Lastly, there are some low-level primitives available in base: plain old mutable references with IORef, semaphores, and MVars which can be used as if they were variables protected by a mutex.
There also are channels in base, but beware: they are unbounded !
Parallelism
This is also an area where Haskell shines because of its non-strict semantics. Non-strictness allows you to write code that expresses your logic in a straightforward manner while not getting committed to a specific evaluation order.
As a consequence, you can describe a parallel evaluation strategy separately from the business logic. Writing parallel code is then just a matter of placing the right annotation in the right spot.
Here is an example that was/is used in production at Bdellium:
map outputParticipant parts `using` parListChunk 10 rdeepseq
^^^^^ business logic ^^^^^^ ^^^^ eval. strategy ^^^^
The code can be understood as follows: Parallel workers will fully evaluate the results of mapping the outputParticipant function to individual items in the parts list, distributing the work in chunks of 10 elements.

This answer will pertain to functional languages in general - no synchronisation are needed. As functions in functional programming have no side effects: functions accept a value and return a value, there's no mutable state. Such functions are inherently thread safe.

Multi-thread state visibility in Java: is there a way to turn the JVM into the worst case scenario?

Suppose our code has 2 threads (A and B) have a reference to the same instance of this class somewhere:
public class MyValueHolder {
private int value = 1;
// ... getter and setter
}
When Thread A does myValueHolder.setValue(7), there is no guarantee that Thread B will ever read that value: myValueHolder.getValue() could - in theory - keep returning 1 forever.
In practice however, the hardware will clear the second level cache sooner or later, so Thread B will read 7 sooner or later (usually sooner).
Is there any way to make the JVM emulate that worst case scenario for which it keeps returning 1 forever for Thread B? That would be very useful to test our multi-threaded code with our existing tests under those circumstances.

jcstress maintainer here. There are multiple ways to answer that question.
The easiest solution would be wrapping the getter in the loop, and let JIT hoist it. This is allowed for non-volatile field reads, and simulates the visibility failure with compiler optimization.
More sophisticated trick involves getting the debug build of OpenJDK, and using -XX:+StressLCM -XX:+StressGCM, effectively doing the instruction scheduling fuzzing. Chances are the load in question will float somewhere you can detect with the regular tests your product has.
I am not sure if there is practical hardware holding the written value long enough opaque to cache coherency, but it is somewhat easy to build the testcase with jcstress. You have to keep in mind that the optimization in (1) can also happen, so we need to employ a trick to prevent that. I think something like this should work.

It would be great to have a Java compiler that would intentionally perform as many weird (but allowed) transfirmations as possible to be able to break thread unsafe code more easily, like Csmith for C. Unfortunately, such a compiler does not exist (as far as I know).
In the meantime, you can try the jcstress library* and exercise your code on several architectures, if possible with weaker memory models (i.e. not x86) to try and break your code:
The Java Concurrency Stress tests (jcstress) is an experimental harness and a suite of tests aid research in the correctness of concurrency support in the JVM, class libraries, and hardware.
But in the end, unfortunately, the only way to prove that a piece of code is 100% correct is code inspection (and I don't know of a static code analysis tool able to detect all race conditions).
*I have not used it and I am unclear which of jcstress and the java-concurrency-torture library is more up to date (I would suspect jcstress).

Not on a real machine, sadly testing multi-threaded code will remain difficult.
As you say, the hardware will clear the second level cache and the JVM has no control over that. The JSL only specifies what must happen and this is a case where B might never see the updated value of value.
The only way to force this to happen on a real machine is to alter the code in such a way to void your testing strategy i.e. you end up testing different code.
However, you might be able to run this on a simulator that simulates hardware that doesn't clear the second level cache. Sounds like a lot of effort though!

I think you are refering to the principle called "false sharing" where different CPUs must synchronize their caches or else face the possibility that data such as you describe could become mismatched. There is a very good article on false sharing on Intel's website. Intel describes some useful tools in their article for diagnosing this problem. This is a relevant quote:
The primary means of avoiding false sharing is through code
inspection. Instances where threads access global or dynamically
allocated shared data structures are potential sources of false
sharing. Note that false sharing can be obscured by the fact that
threads may be accessing completely different global variables that
happen to be relatively close together in memory. Thread-local storage
or local variables can be ruled out as sources of false sharing.
Although methods described in the article are not what you have asked for (forcing worst-case behavior from the JVM), as already stated this isn't really possible. The methods described in this article are the best way I know to try to diagnose and avoid false sharing.
There are other resources addressing this problem around the web. For example, this article has a suggestion for a way to avoid false sharing in Java. I have not tried this method, so I cannot vouch for it, but I think the author's idea is sound. You might consider trying out his suggestion.

I have previously suggested a worst case behaving JVM for test purposes on the memory model list but the idea didn't seem popular.
So how to gain "worst case JVM behaviour" , with existing technology i.e how can I test the scenario in the question and get it to fail EVERY time. You could try to find the setup with the weakest memory model possible but that's unlikely to be perfect.
What I have often considered is using a distributed JVM something similar to how I believe Terracotta works under the cover so your application now runs on multiple JVM's (either remote or local) (threads in the same application run in different instances). In this setup inter JVM thread communication takes place at memory barriers e.g. the synchronized keywords you are missing in bugged code for instance (it conforms to the Java Memory Model) and the application is configured i.e. you say this class thread runs here . No code change required to your tests just configuration, any well ordered java application should run out of the box, however this setup would be very intolerant of a badly ordered application (normally a problem ... now an asset i.e. the Memory model exhibits very weak but legal behavior). In the example above loading the code onto a cluster, if two threads run on different nodes setValue has no effect visible to the other thread unless the code was changed and synchronized, volatile etc etc were used, then the code works as intended.
Now your test for the example above (configured correctly) would fail every time without correct "happens before ordering" which is potentially very useful for tests. The flaw in the plan for complete coverage you would need a potentially a node per application thread (can be same machine or multiple in a cluster) or multiple test runs. If you have 1000's of threads then that could be prohibitive though hopefully they would be pooled and scaled down for E2E test scenarios or run it in a cloud. If nothing else this kind of setup might be useful in demonstrating the issue.
inter thread communication across JVMs

The example you have given is described as Incorrectly Synchronized in http://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html#jls-17.4. I think this is always incorrect and will lead to bugs sooner or later. Most of the times later :-).
To find such incorrectly synchronized code blocks, I use the following algorithm:
Record the threads for all field modifications using instrumentation. If a field is modified by more than one thread without synchronization, I have found a data race.
I implemented this algorithm inside http://vmlens.com, which is a tool to find data races inside java programs.

Here's a simple way: just comment out the code for setValue. You can uncomment it after testing. Since in many cases like this a mechanism is needed to fake failures, it would be a good idea to build a general mechanism for all such cases.

C++ (and possibly Java) how are objects locked for synchronization?

When objects are locked in languages like C++ and Java where actually on a low level scale) is this performed? I don't think it's anything to do with the CPU/cache or RAM. My best guestimate is that this occurs somewhere in the OS? Would it be within the same part of the OS which performs context switching?
I am referring to locking objects, synchronizing on method signatures (Java) etc.
It could be that the answer depends on which particular locking mechanism?

Locking involves a synchronisation primitive, typically a mutex. While naively speaking a mutex is just a boolean flag that says "locked" or "unlocked", the devil is in the detail: The mutex value has to be read, compared and set atomically, so that multiple threads trying for the same mutex don't corrupt its state.
But apart from that, instructions have to be ordered properly so that the effects of a read and write of the mutex variable are visible to the program in the correct order and that no thread inadvertently enters the critical section when it shouldn't because it failed to see the lock update in time.
There are two aspects to memory access ordering: One is done by the compiler, which may choose to reorder statements if that's deemed more efficient. This is relatively trivial to prevent, since the compiler knows when it must be careful. The far more difficult phenomenon is that the CPU itself, internally, may choose to reorder instructions, and it must be prevented from doing so when a mutex variable is being accessed for the purpose of locking. This requires hardware support (e.g. a "lock bit" which causes a pipeline flush and a bus lock).
Finally, if you have multiple physical CPUs, each CPU will have its own cache, and it becomes important that state updates are propagated to all CPU caches before any executing instructions make further progress. This again requires dedicated hardware support.
As you can see, synchronisation is a (potentially) expensive business that really gets in the way of concurrent processing. That, however, is simply the price you pay for having one single block of memory on which multiple independent context perform work.

There is no concept of object locking in C++. You will typically implement your own on top of OS-specific functions or use synchronization primitives provided by libraries (e.g. boost::scoped_lock). If you have access to C++11, you can use the locks provided by the threading library which has a similar interface to boost, take a look.
In Java the same is done for you by the JVM.

The java.lang.Object has a monitor built into it. That's what is used to lock for the synchronized keyword. JDK 6 added a concurrency packages that give you more fine-grained choices.
This has a nice explanation:
http://www.artima.com/insidejvm/ed2/threadsynch.html
I haven't written C++ in a long time, so I can't speak to how to do it in that language. It wasn't supported by the language when I last wrote it. I believe it was all 3rd party libraries or custom code.

It does depend on the particular locking mechanism, typically a semaphore, but you cannot be sure, since it is implementation dependent.

All architectures I know of use an atomic Compare And Swap to implement their synchronization primitives. See, for example, AbstractQueuedSynchronizer, which was used in some JDK versions to implement Semiphore and ReentrantLock.

Why can't the JVM add synchronized/volatile/Lock at runtime?

Since all java applications are run eventually by the JVM, why can't the JVM wrap around single-threaded code into a multi-thread code at runtime depending on how many threads are running/accessing a part of the code.
The JVM sure is aware of the number of threads running and it sure knows which classes are Threads and which part of code can be accessed by multiple threads.
What are the reasons this cannot be implemented or what can make this complex?

Simply spraying synchronized/volatile/Lock on anything that's used by multiple threads does not result in correct multi-threaded behavior. How would the runtime know the correct granularity of locks, for example? How would it avoid deadlocks?
The early collections classes, eg: Vector and Hashtable were designed with a similarly naive view of concurrency. Everything was synchronized. It turns out that you could still get into trouble quite easily, however. For example, suppose you wanted to check that a Vector contained at least one element, and if so then you'd remove one. Each of the calls to the Vector would be synchronized, but another thread could execute between these calls, and so you could end up with race condition bugs. (This is what I was referring to when I mentioned granularity of locks, earlier.)

Not possible in general for the JVM
Automatically adding synchronization usually does not lead to a positive effect. Synchronization costs both, performance and memory. Performance, because the processor must check the underlying locks. And memory because the locks must be stored somewhere. When the runtime adds locks everywhere, the program will run single threaded (because every method can be only accessed from one thread at a time), but now with higher costs for the CPU and more memory load (because of the lock handling).
The JVM can remove locks automatically
Usually the Java runtime does not have enough information to add locks in a clever way. But it does the opposite: With the so called "escape analysis" it can check, whether a memory block never escapes a certain code block (and is never shared to another thread). If this is the case, several optimizations are applied. One of them is, that the VM removes all synchronizations for this block.
Database engines can do it
There are systems that have enough information to automatically apply locks: database management systems. The more sophisticated database engines use a technique called "multi version concurrency". With this technique, one needs only locks for writing data, not for reading data. So one needs fewer locks as with a traditional approach and more code can run in parallel. But this comes with a cost: Sometimes the degree of parallelism becomes to high and the system comes in a inconsistent state. The system then undoes some of the changes and repeats them at a later time.
Automatic locks with STM and Clojure
This approach can be brought to the JVM in a (too some degree) automatic way. It is then called "software transactional memory". This is very close to your idea of automatic locks and leaves enough room for parallelism to be useful. On the JVM the language Clojure uses software transactional memory.
So while the JVM cannot add locks automatically in general, Clojure enables this to a certain degree. Try it and look how good it serves you.

I can think about the following reasons:
application may use static variables, so 2 application that partially share classes they are using might bother to each other by changing shared state.
What you actually want is implemented by Java EE container that is running several applications pusillanimously. It seems that you are suggesting the JSE container (I have no idea whether this term exists). Try to suggest it to Oracle. It could be a cool JSR!

Why do all Java Objects have wait() and notify() and does this cause a performance hit?

Every Java Object has the methods wait() and notify() (and additional variants). I have never used these and I suspect many others haven't. Why are these so fundamental that every object has to have them and is there a performance hit in having them (presumably some state is stored in them)?
EDIT to emphasize the question. If I have a List<Double> with 100,000 elements then every Double has these methods as it is extended from Object. But it seems unlikely that all of these have to know about the threads that manage the List.
EDIT excellent and useful answers. #Jon has a very good blog post which crystallised my gut feelings. I also agree completely with #Bob_Cross that you should show a performance problem before worrying about it. (Also as the nth law of successful languages if it had been a performance hit then Sun or someone would have fixed it).

Well, it does mean that every object has to potentially have a monitor associated with it. The same monitor is used for synchronized. If you agree with the decision to be able to synchronize on any object, then wait() and notify() don't add any more per-object state. The JVM may allocate the actual monitor lazily (I know .NET does) but there has to be some storage space available to say which monitor is associated with the object. Admittedly it's possible that this is a very small amount (e.g. 3 bytes) which wouldn't actually save any memory anyway due to padding of the rest of the object overhead - you'd have to look at how each individual JVM handled memory to say for sure.
Note that just having extra methods doesn't affect performance (other than very slightly due to the code obvious being present somewhere). It's not like each object or even each type has its own copy of the code for wait() and notify(). Depending on how the vtables work, each type may end up with an extra vtable entry for each inherited method - but that's still only on a per type basis, not a per object basis. That's basically going to get lost in the noise compared with the bulk of the storage which is for the actual objects themselves.
Personally, I feel that both .NET and Java made a mistake by associating a monitor with every object - I'd rather have explicit synchronization objects instead. I wrote a bit more on this in a blog post about redesigning java.lang.Object/System.Object.

Why are these so fundamental that
every object has to have them and is
there a performance hit in having them
(presumably some state is stored in
them)?
tl;dr: They are thread-safety methods and they have small costs relative to their value.
The fundamental realities that these methods support are that:
Java is always multi-threaded. Example: check out the list of Threads used by a process using jconsole or jvisualvm some time.
Correctness is more important than "performance." When I was grading projects (many years ago), I used to have to explain "getting to the wrong answer really fast is still wrong."
Fundamentally, these methods provide some of the hooks to manage per-Object monitors used in synchronization. Specifically, if I have synchronized(objectWithMonitor) in a particular method, I can use objectWithMonitor.wait() to yield that monitor (e.g., if I need another method to complete a computation before I can proceed). In that case, that will allow one other method that was blocked waiting for that monitor to proceed.
On the other hand, I can use objectWithMonitor.notifyAll() to let Threads that are waiting for the monitor know that I am going to be relinquishing the monitor soon. They can't actually proceed until I leave the synchronized block, though.
With respect to specific examples (e.g., long Lists of Doubles) where you might worry that there's a performance or memory hit on the monitoring mechanism, here are some points that you should likely consider:
First, prove it. If you think there is a major impact from a core Java mechanism such as multi-threaded correctness, there's an excellent chance that your intuition is false. Measure the impact first. If it's serious and you know that you'll never need to synchronize on an individual Double, consider using doubles instead.
If you aren't certain that you, your co-worker, a future maintenance coder (who might be yourself a year later), etc., will never ever ever need a fine granularity of theaded access to your data, there's an excellent chance that taking these monitors away would only make your code less flexible and maintainable.
Follow-up in response to the question on per-Object vs. explicit monitor objects:
Short answer: #JonSkeet: yes, removing the monitors would create problems: it would create friction. Keeping those monitors in Object reminds us that this is always a multithreaded system.
The built-in object monitors are not sophisticated but they are: easy to explain; work in a predictable fashion; and are clear in their purpose. synchronized(this) is a clear statement of intent. If we force novice coders to use the concurrency package exclusively, we introduce friction. What's in that package? What's a semaphore? Fork-join?
A novice coder can use the Object monitors to write decent model-view-controller code. synchronized, wait and notifyAll can be used to implement naive (in the sense of simple, accessible but perhaps not bleeding-edge performance) thread-safety. The canonical example would be one of these Doubles (posited by the OP) which can have one Thread set a value while the AWT thread gets the value to put it on a JLabel. In that case, there is no good reason to create an explicit additional Object just to have an external monitor.
At a slightly higher level of complexity, these same methods are useful as an external monitoring method. In the example above, I explicitly did that (see objectWithMonitor fragments above). Again, these methods are really handy for putting together relatively simple thread safety.
If you would like to be even more sophisticated, I think you should seriously think about reading Java Concurrency In Practice (if you haven't already). Read and write locks are very powerful without adding too much additional complexity.
Punchline: Using basic synchronization methods, you can exploit a large portion of the performance enabled by modern multi-core processors with thread-safety and without a lot of overhead.

All objects in Java have monitors associated with them. Synchronization primitives are useful in pretty much all multi-threaded code, and its semantically very nice to synchronize on the object(s) you are accessing rather than on separate "Monitor" objects.
Java may allocate the Monitors associated with the objects as needed - as .NET does - and in any case the actual overhead for simply allocating (but not using) the lock would be quite small.
In short: its really convenient to store Objects with their thread safety support bits, and there is very little performance impact.

These methods are around to implement inter-thread communication.
Check this article on the subject.
Rules for those methods, taken from that article:
wait( ) tells the calling thread to give up the monitor and go to sleep until some other
thread enters the same monitor and calls notify( ).
notify( ) wakes up the first thread that called wait( ) on the same object.
notifyAll( ) wakes up all the threads that called wait( ) on the same object. The
highest priority thread will run first.
Hope this helps...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.