Java shared-address-space and message-passing parallel programming paradigms

Java shared-address-space and message-passing parallel programming paradigms - java

In the shared-address-space model there is a common address space shared between
processes and represented as data structures in memory (like ConcurrentHashMap). This gives the advantage of very fast data sharing as shared objects are located on a single computer (let us suppose so for simplicity). As processes may collide, various lock mechanisms (mutex) are helpful to ensure mutual exclusion in accessing shared memory.
This scheme lacks of scalability as an increase in processor number can raise geometrically the traffic on shared memory and a single computer can not have more than say 8 processors.
In the message-passing model there is no sense of global address space. Each
process has its one private local memory. Processes can communicate with each
other via passing messages. Unlike shared-address-space, the message-
passing model offers scalability between processors and memory, although requires the common data to be replicated. An increase in processors will proportionally increase the memory (for that data) size as well, though no lock mechanisms are required in this case.
Reading "Thinking in Java" for inspiration I find only a talk about the shared-address-space model with synchronization principles. As my problem grows in complexity I'm going to try the message-passing paradigm, which as far as I'm not blind, is not presented in the book.
Could you please recommend Java native classes or any proved external library to work with the message-passing model, something like MPI in C++? Any link to that source would be highly appreciated!

Akka is a commonly-used actor framework for the JVM - available for both Java and Scala.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.
http://hadoop.apache.org/

Related

Dynamically messaging vs Latency design issue

I would like to know yours thoughts about my design from your experience.
I am designing a system having a very critical part:
I have component A,B,C(on the same JVM) which need to "speak" with each other.
I could have two ways doing so:
Method call way(each one holds each other instances (injection, object instance,etc..)
messaging way(topic/queue)
I am aware to cons of having a middle ware messing system(option-2).
BUT:
I am talking about latency considerations.
I need to have those messages reached to the targets in low latency (talking about ms latency).
I would like to choose option-2(the messaging way).
By your experience how much it will affect my latency? again latency is a very huge factor in this decision.
(Programming with Java, not sure which app container yet (Spring, Jboss..)
thanks,
ray.

Given that the messaging way is in-memory, within the same JVM. Then typically most latency comes from a combination of contention (use of synchronized etc), scheduling (how threads are woken up to do their jobs and so forth) and GC. Those sources of latency tend to dwarf everything else.
It is possible to write fairly light weight messaging systems that do not add much overhead. A good example of this would be Akka, which is increasingly finding its way into low latency financial systems. It is better known in the Scala realm, but it does have a Java API.
In conclusion, a messaging system can be implemented to sub millisecond demands. However make sure that it fits your needs first. Just because you can, does not mean that you should. If you are working on a small system then dependency injection/inversion of control may be all that you need to have a good design. However if you are looking at messaging as a way to bring multiple cpu cores into the mix, or some such then I recommend taking a look at Akka. Even if only as a case study.

JVM heap replication between two machines

What are the basic principles of how two separable computers connected within the same network running the same Java application maintain the same state by syncing their heap between each other?
I believe Terracotta does this task but I have no idea how would some pseudo code look like that would describe its core functions.
I'm just looking for understanding of this technology.

Terracotta DSO works by manipulating the byte code of your classes (and the JDK's classes etc). The instructions on how and when to do this are part of the Terracotta configuration file.
The bytecode modification looks for certain byte codes such as a field read or write or a monitor enter or exit. Whenever those instructions occur, code is added around that location that does the appropriate action in the distributed store. For example when a monitor is obtained due to synchronization, a distributed lock is obtained as well (whether it is a read or write lock is dependent on the configuration). If a field in a shared object is written, the distributed system must verify that a write lock is being held and then send the data value is sent to the clustered server, which stores it on disk or shares it over the network as appropriate.
Note that Terracotta does not share the entire heap, only the graph of objects indicated by the configuration. In general, there would be little point in sharing an entire heap. It is better instead for the application to describe the domain objects needed across the distributed application.
There are many optimizations employed to make the operations above efficient: only field deltas are sent over the wire and in a form much more efficient than Java serialization, many deltas can be bundled and sent in batches, locks are actually "checked out" to a particular client so that if the application data is partitioned across clients, most distributed locks are actually a local operation not involving a network call, etc.

Terracotta can indeed handle that if you tell it to - see the description of its DSO - Distributed Shared Objects.
It sounds cool but I would prefer something like EHcache (can be backed by Terracotta again) which functions on a bit more high level.

One emerging technology that somehow tackles this problem is Distributed Software Transactional Memory. You get strong data consistency guarantees (i.e. 1-copy serializability) and a powerful concurrency control mechanism: transactions.
AFAIK, there is no mature solution out there, but it is promising.

I would recommend that you investigate http://www.jboss.org/infinispan and see if it will fulfill your needs.

Concurrent programming techniques, pros, cons

There is at least three well-known approaches for creating concurrent applications:
Multithreading and memory synchronization through locking(.NET, Java). Software Transactional Memory (link text) is another approach to synchronization.
Asynchronous message passing (Erlang).
I would like to learn if there are other approaches and discuss various pros and cons of these approaches applied to large distributed applications. My main focus is on simplifying life of the programmer.
For example, in my opinion, using multiple threads is easy when there is no dependencies between them, which is pretty rare. In all other cases thread synchronization code becomes quite cumbersome and hard to debug and reason about.

I'd strongly recommend looking at this presentation by Rich Hickey. It describes an approach to building high performance, concurrent applications which I would argue is distinct from lock-based or message-passing designs.
Basically it emphasises:
Lock free, multi-threaded concurrent applications
Immutable persistent data structures
Changes in state handled by Software Transactional Memory
And talks about how these principles influenced the design of the Clojure language.

Read Herb Sutter's Effective Concurrency column, and you too will be enlightened.

With the Java 5 concurrency API, doing concurrent programming in Java doesn't have to be cumbersome and difficult as long as you take advantage of the high-level utilities and use them correctly. I found the book, Java Concurrency in Practice by Brian Goetz, to be an excellent read about this subject. At my last job, I used the techniques from this book to make some image processing algorithms scale to multiple CPUs and to pipeline CPU and disk bound tasks. I found it to be a great experience and we got excellent results.
Or if you are using C++ you could try OpenMP, which uses #pragma directives to make loops parallel, although I've never used it myself.

In Erlang and OTP in Action, the authors present four process communication paradigms:
Shared memory with locks
A construct (lock) is used to
restrict access to shared resources.
Hardware support is often required
from the memory system, in terms of
special instructions. Among the possible drawbacks of this approach: overhead, points of contention in the memory system, debug difficulty, especially with huge number of processes.
Software Transactional Memory
Memory is treated as a database,
where transactions decide what to
write and when. The main problem here
is represented by the possible
contentions and by the number of
failed transaction attempts.
Futures, promises and similar
The basic idea is that a future is a
result of a computation that has been
outsourced to a different process
(potentially on a different CPU or
machine) and that can be passed around
like any other object. In case of
network failures problem can arise.
Message passing
Synchronous or asynchronous, in
Erlang style.

Any concept of shared memory in Java

AFAIK, memory in Java is based on heap from which the memory is allotted to objects dynamically and there is no concept of shared memory.
If there is no concept of shared memory, then the communication between Java programs should be time consuming. In C where inter-process communication is quicker via shared memory compared to other modes of communication.
Correct me if I'm wrong. Also what is the quickest way for 2 Java progs to talk to each other.

A few ways:
RAM Drive
Apache APR
OpenHFT Chronicle Core
Details here and here with some performance measurements.

Since there is no official API to create a shared memory segment, you need to resort to a helper library/DDL and JNI to use shared memory to have two Java processes talk to each other.
In practice, this is rarely an issue since Java supports threads, so you can have two "programs" run in the same Java VM. Those will share the same heap, so communication will be instantaneous. Plus you can't get errors because of problems with the shared memory segment.

Java Chronicle is worth looking at; both Chronicle-Queue and Chronicle-Map use shared memory.
These are some tests that I had done a while ago comparing various off-heap and on-heap options.

One thing to look at is using memory-mapped files, using Java NIO's FileChannel class or similar (see the map() method). We've used this very successfully to communicate (in our case one-way) between a Java process and a C native one on the same machine.
I'll admit I'm no filesystem expert (luckily we do have one on staff!) but the performance for us is absolutely blazingly fast -- effectively you're treating a section of the page cache as a file and reading + writing to it directly without the overhead of system calls. I'm not sure about the guarantees and coherency -- there are methods in Java to force changes to be written to the file, which implies that they are (sometimes? typically? usually? normally? not sure) written to the actual underlying file (somewhat? very? extremely?) lazily, meaning that some proportion of the time it's basically just a shared memory segment.
In theory, as I understand it, memory-mapped files CAN actually be backed by a shared memory segment (they're just file handles, I think) but I'm not aware of a way to do so in Java without JNI.

Shared memory is sometimes quick. Sometimes its not - it hurts CPU caches and synchronization is often a pain (and should it rely upon mutexes and such, can be a major performance penalty).
Barrelfish is an operating system that demonstrates that IPC using message passing is actually faster than shared memory as the number of cores increases (on conventional X86 architectures as well as the more exotic NUMA NUCA stuff you'd guess it was targeting).
So your assumption that shared memory is fast needs testing for your particular scenario and on your target hardware. Its not a generic sound assumption these days!

There's a couple of comparable technologies I can think of:
A few years back there was a technology called JavaSpaces but that never really seemed to take hold, a shame if you ask me.
Nowadays there are the distributed cache technologies, things like Coherence and Tangosol.
Unfortunately neither will have the out right speed of shared memory, but they do deal with the issues of concurrent access, etc.

The easiest way to do that is to have two processes instantiate the same memory-mapped file. In practice they will be sharing the same off-heap memory space. You can grab the physical address of this memory and use sun.misc.Unsafe to write/read primitives. It supports concurrency through the putXXXVolatile/getXXXVolatile methods. Take a look on CoralQueue which offers IPC easily as well as inter-thread communication inside the same JVM.
Disclaimer: I am one of the developers of CoralQueue.

Similar to Peter Lawrey's Java Chronicle, you can try Jocket.
It also uses a MappedByteBuffer but does not persist any data and is meant to be used as a drop-in replacement to Socket / ServerSocket.
Roundtrip latency for a 1kB ping-pong is around a half-microsecond.

MappedBus (http://github.com/caplogic/mappedbus) is a library I've added on github which enable IPC between multiple (more than two) Java processes/JVMs by message passing.
The transport can be either a memory mapped file or shared memory. To use it with shared memory simply follow the examples on the github page but point the readers/writers to a file under "/dev/shm/".
It's open source and the implementation is fully explained on the github page.

The information provided by Cowan is correct. However, even shared memory won't always appear to be identical in multiple threads (and/or processes) at the same time. The key underlying reason is the Java memory model (which is built on the hardware memory model). See Can multiple threads see writes on a direct mapped ByteBuffer in Java? for a quite useful discussion of the subject.

Performance / stability of a Memory Mapped file - Native or MappedByteBuffer - vs. plain ol' FileOutputStream

I support a legacy Java application that uses flat files (plain text) for persistence. Due to the nature of the application, the size of these files can reach 100s MB per day, and often the limiting factor in application performance is file IO. Currently, the application uses a plain ol' java.io.FileOutputStream to write data to disk.
Recently, we've had several developers assert that using memory-mapped files, implemented in native code (C/C++) and accessed via JNI, would provide greater performance. However, FileOutputStream already uses native methods for its core methods (i.e. write(byte[])), so it appears a tenuous assumption without hard data or at least anecdotal evidence.
I have several questions on this:
Is this assertion really true?
Will memory mapped files always
provide faster IO compared to Java's
FileOutputStream?
Does the class MappedByteBuffer
accessed from a FileChannel provide
the same functionality as a native
memory mapped file library accessed
via JNI? What is MappedByteBuffer
lacking that might lead you to use a
JNI solution?
What are the risks of using
memory-mapped files for disk IO in a production
application? That is, applications
that have continuous uptime with
minimal reboots (once a month, max).
Real-life anecdotes from production
applications (Java or otherwise)
preferred.
Question #3 is important - I could answer this question myself partially by writing a "toy" application that perf tests IO using the various options described above, but by posting to SO I'm hoping for real-world anecdotes / data to chew on.
[EDIT] Clarification - each day of operation, the application creates multiple files that range in size from 100MB to 1 gig. In total, the application might be writing out multiple gigs of data per day.

Memory mapped I/O will not make your disks run faster(!). For linear access it seems a bit pointless.
A NIO mapped buffer is the real thing (usual caveat about any reasonable implementation).
As with other NIO direct allocated buffers, the buffers are not normal memory and wont get GCed as efficiently. If you create many of them you may find that you run out of memory/address space without running out of Java heap. This is obviously a worry with long running processes.

You might be able to speed things up a bit by examining how your data is being buffered during writes. This tends to be application specific as you would need an idea of the expected data writing patterns. If data consistency is important, there will be tradeoffs here.
If you are just writing out new data to disk from your application, memory mapped I/O probably won't help much. I don't see any reason you would want to invest time in some custom coded native solution. It just seems like too much complexity for your application, from what you have provided so far.
If you are sure you really need better I/O performance - or just O performance in your case, I would look into a hardware solution such as a tuned disk array. Throwing more hardware at the problem is often times more cost effective from a business point of view than spending time optimizing software. It is also usually quicker to implement and more reliable.
In general, there are a lot of pitfalls in over optimization of software. You will introduce new types of problems to your application. You might run into memory issues/ GC thrashing which would lead to more maintenance/tuning. The worst part is that many of these issues will be hard to test before going into production.
If it were my app, I would probably stick with the FileOutputStream with some possibly tuned buffering. After that I'd use the time honored solution of throwing more hardware at it.

From my experience, memory mapped files perform MUCH better than plain file access in both real time and persistence use cases. I've worked primarily with C++ on Windows, but Linux performances are similar, and you're planning to use JNI anyway, so I think it applies to your problem.
For an example of a persistence engine built on memory mapped file, see Metakit. I've used it in an application where objects were simple views over memory-mapped data, the engine took care of all the mapping stuff behind the curtains. This was both fast and memory efficient (at least compared with traditional approaches like those the previous version used), and we got commit/rollback transactions for free.
In another project I had to write multicast network applications. The data was send in randomized order to minimize the impact of consecutive packet loss (combined with FEC and blocking schemes). Moreover the data could well exceed the address space (video files were larger than 2Gb) so memory allocation was out of question. On the server side, file sections were memory-mapped on demand and the network layer directly picked the data from these views; as a consequence the memory usage was very low. On the receiver side, there was no way to predict the order into which packets were received, so it has to maintain a limited number of active views on the target file, and data was copied directly into these views. When a packet had to be put in an unmapped area, the oldest view was unmapped (and eventually flushed into the file by the system) and replaced by a new view on the destination area. Performances were outstanding, notably because the system did a great job on committing data as a background task, and real-time constraints were easily met.
Since then I'm convinced that even the best fine-crafted software scheme cannot beat the system's default I/O policy with memory-mapped file, because the system knows more than user-space applications about when and how data must be written. Also, what is important to know is that memory mapping is a must when dealing with large data, because the data is never allocated (hence consuming memory) but dynamically mapped into the address space, and managed by the system's virtual memory manager, which is always faster than the heap. So the system always use the memory optimally, and commits data whenever it needs to, behind the application's back without impacting it.
Hope it helps.

As for point 3 - if the machine crashes and there are any pages that were not flushed to disk, then they are lost. Another thing is the waste of the address space - mapping a file to memory consumes address space (and requires contiguous area), and well, on 32-bit machines it's a bit limited. But you've said about 100MB - so it should not be a problem. And one more thing - expanding the size of the mmaped file requires some work.
By the way, this SO discussion can also give you some insights.

If you write fewer bytes it will be faster. What if you filtered it through gzipoutputstream, or what if you wrote your data into ZipFiles or JarFiles?

As mentioned above, use NIO (a.k.a. new IO). There's also a new, new IO coming out.
The proper use of a RAID hard drive solution would help you, but that would be a pain.
I really like the idea of compressing the data. Go for the gzipoutputstream dude! That would double your throughput if the CPU can keep up. It is likely that you can take advantage of the now-standard double-core machines, eh?
-Stosh

I did a study where I compare the write performance to a raw ByteBuffer versus the write performance to a MappedByteBuffer. Memory-mapped files are supported by the OS and their write latencies are very good as you can see in my benchmark numbers. Performing synchronous writes through a FileChannel is approximately 20 times slower and that's why people do asynchronous logging all the time. In my study I also give an example of how to implement asynchronous logging through a lock-free and garbage-free queue for ultimate performance very close to a raw ByteBuffer.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.