Using files for for shared memory IPC

Using files for for shared memory IPC - java

In my application, there is one process which writes data to a file, and then, in response to receiving a request, will send (some) of that data via the network to the requesting process. The basis of this question is to see if we can speed up communication when both processes happen to be on the same host. (In my case, the processes are Java, but I think this discussion can apply more broadly.)
There are a few projects out there which use the MappedByteBuffers returned by Java's FileChannel.map() as a way to have shared memory IPC between JVMs on the same host (see Chronicle Queue, Aeron IPC, etc.).
One approach to speeding up same-host communication would be to have my application use one of those technologies to provide the request-response pathway for same-host communication, either in conjunction with the existing mechanism for writing to the data file, or by providing a unified means of both communication and writing to the file.
Another approach would be to allow the requesting process to have direct access to the data file.
I tend to favor the second approach - assuming it would be correct - as it would be easier to implement, and seems more efficient than copying/transmitting a copy of the data for each request (assuming we didn't replace the existing mechanism for writing to the file).
Essentially, I'd like to understanding what exactly occurs when two processes have access to the same file, and use it to communicate, specifically Java (1.8) and Linux (3.10).
From my understanding, it seems like if two processes have the same file open at the same time, the "communication" between them will essentially be via "shared memory".
Note that this question is not concerned with the performance implication of using a MappedByteBuffer or not - it seem highly likely that a using mapped buffers, and the reduction in copying and system calls, will reduce overhead compared to reading and writing the file, but that might require significant changes to the application.
Here is my understanding:
When Linux loads a file from disk, it copies the contents of that file to pages in memory. That region of memory is called the page cache. As far as I can tell, it does this regardless of which Java method (FileInputStream.read(), RandomAccessFile.read(), FileChannel.read(), FileChannel.map()) or native method is used to read the file (obseved with "free" and monitoring the "cache" value).
If another process attempts to load the same file (while it is still resident in the cache) the kernel detects this and doesn't need to reload the file. If the page cache gets full, pages will get evicted - dirty ones being written back out to the disk. (Pages also get written back out if there is an explicit flush to disk, and periodically, with a kernel thread).
Having a (large) file already in the cache is a significant performance boost, much more so than the differences based on which Java methods we use to open/read that file.
If a file is loaded using the mmap system call (C) or via FileChannel.map() (Java), essentially the file's pages (in the cache) are loaded directly into the process' address space. Using other methods to open a file, the file is loaded into pages not in the process' address space, and then the various methods to read/write that file copy some bytes from/to those pages into a buffer in the process' address space. There is an obvious performance benefit avoiding that copy, but my question isn't concerned with performance.
So in summary, if I understand correctly - while mapping offer a performance advantage, it doesn't seem like it offers any "shared memory" functionality that we don't already get just from the nature of the Linux and the page cache.
So, please let me know where my understanding is off.
Thanks.

My question is, on Java (1.8) and Linux (3.10), are MappedByteBuffers really necessary for implementing shared memory IPC, or would any access to a common file provide the same functionality?
It depends on why you want to implement shared memory IPC.
You can clearly implement IPC without shared memory; e.g. over sockets. So, if you are not doing it for performance reasons, it is not necessary to do shared memory IPC at all!
So performance has to be at the root of any discussion.
Access using files via the Java classic io or nio APIs does not provide shared memory functionality or performance.
The main difference between regular file I/O or Socket I/O versus shared memory IPC is that the former requires the applications to explicitly make read and write syscalls to send and receive messages. This entails extra syscalls, and entails the kernel copying data. Furthermore, if there are multiple threads you either need a separate "channel" between each thread pair or something to multiplexing multiple "conversations" over a shared channel. The latter can lead to the shared channel becoming a concurrency bottleneck.
Note that these overheads are orthogonal to the Linux page cache.
By contrast, with IPC implemented using shared memory, there are no read and write syscalls, and no extra copy step. Each "channel" can simply use a separate area of the mapped buffer. A thread in one process writes data into the shared memory and it is almost immediately visible to the second process.
The caveat is that the processes need to 1) synchronize, and 2) implement memory barriers to ensure that the reader doesn't see stale data. But these can both be implemented without syscalls.
In the wash-up, shared memory IPC using memory mapped files >>is<< faster than using conventional files or sockets, and that is why people do it.
You also implicitly asked if shared memory IPC can be implemented without memory mapped files.
A practical way would be to create a memory-mapped file for a file that lives in a memory-only file system; e.g. a "tmpfs" in Linux.
Technically, that is still a memory-mapped file. However, you don't incur overheads of flushing data to disk, and you avoid the potential security concern of private IPC data ending up on disk.
You could in theory implement a shared segment between two processes by doing the following:
In the parent process, use mmap to create a segment with MAP_ANONYMOUS | MAP_SHARED.
Fork child processes. These will end up all sharing the segment with each other and the parent process.
However, implementing that for a Java process would be ... challenging. AFAIK, Java does not support this.
Reference:
What is the purpose of MAP_ANONYMOUS flag in mmap system call?

Essentially, I'm trying to understand what happens when two processes have the same file open at the same time, and if one could use this to safely and performantly offer communication between to processes.
If you are using regular files using read and write operations (i.e. not memory mapping them) then the two processes do not share any memory.
User-space memory in the Java Buffer objects associated with the file is NOT shared across address spaces.
When a write syscall is made, data is copied from pages in one processes address space to pages in kernel space. (These could be pages in the page cache. That is OS specific.)
When a read syscall is made, data is copied from pages in kernel space to pages in the reading processes address space.
It has to be done that way. If the operating system shared pages associated with the reader and writer processes buffers behind their backs, then that would be an security / information leakage hole:
The reader would be able to see data in the writer's address space that had not yet been written via write(...), and maybe never would be.
The writer would be able to see data that the reader (hypothetically) wrote into its read buffer.
It would not be possible to address the problem by clever use of memory protection because the granularity of memory protection is a page versus the granularity of read(...) and write(...) which is as little as a single byte.
Sure: you can safely use reading and writing files to transfer data between two processes. But you would need to define a protocol that allows the reader to know how much data the writer has written. And the reader knowing when the writer has written something could entail polling; e.g. to see if the file has been modified.
If you look at this in terms of just the data copying in the communication "channel"
With memory mapped files you copy (serialize) the data from application heap objects to the mapped buffer, and a second time (deserialize) from the mapped buffer to application heap objects.
With ordinary files there are two additional copies: 1) from the writing processes (non-mapped) buffer to kernel space pages (e.g. in the page cache), 2) from the kernel space pages to the reading processes (non-mapped) buffer.
The article below explains what is going on with conventional read / write and memory mapping. (It is in the context of copying a file and "zero-copy", but you can ignore that.)
Reference:
Zero Copy I: User-Mode Perspective

Worth mentioning three points: performance, and concurrent changes, and memory utilization.
You are correct in the assessment that MMAP-based will usually offer performance advantage over file based IO. In particular, the performance advantage is significant if the code perform lot of small IO at artbitrary point of the file.
consider changing the N-th byte: with mmap buffer[N] = buffer[N] + 1, and with file based access you need (at least) 4 system calls + error checking:
seek() + error check
read() + error check
update value
seek() + error check
write + error check
It is true the the number of actual IO (to the disk) most likely be the same.
The second point worth noting the concurrent access. With file based IO, you have to worry about potential concurrent access. You will need to issue explicit locking (before the read), and unlock (after the write), to prevent two processes for incorrectly accessing the value at the same time. With shared memory, atomic operations can eliminate the need for additional lock.
The third point is actual memory usage. For cases where the size of the shared objects is significant, using shared memory can allow large number of processes to access the data without allocating additional memory. If systems constrained by memory, or system that need to provide real-time performance, this could be the only way to access the data.

Related

Why java FileOutputStream's write() or flush() doesn't make NFS client really send data to NFS server?

My Java web application use NFS file system, I use FileOutputStream to open, write multiple chunks and then close the file.
From the profiler stats I found that stream.write(byte[] payload,int begin, int length) and even stream.flush() takes zero milliseconds. Only the method call stream.close() takes non-zero milliseconds.
It seems that java FileOutputStream's write() or flush() doesn't really cause NFS client to send data to NFS server. Is there any other Java class will make NFS client flush data in real time? or there is some NFS client tuning need to be done?

You are probably running into Unix client-side caching. There are lots of details here in the O'Reilly NFS book.
But in short:
Using the buffer cache and allowing async threads to cluster multiple buffers introduces some problems when several machines are reading from and writing to the same file. To prevent file inconsistency with multiple readers and writers of the same file, NFS institutes a flush-on-close policy:
All partially filled NFS data buffers for a file are written to the NFS server when the file is closed.
For NFS Version 3 clients, any writes that were done with the stable flag set to off are forced onto the server's stable storage via the commit operation.
NFS cache consistency uses an approach called close-to-open cache consistency - that is, you have to close the file before your server (and other clients) get a consistent up-to-date view of the file. You are seeing the downsides of this approach, which aims to minimize server hits.
Avoiding the cache is hard from Java. You'd need to set the file open() O_DIRECT flag if you're using Linux; see this answer for more https://stackoverflow.com/a/16259319/5851520, but basically it disables the client's OS cache for that file, though not the server's.
Unfortunately, the standard JDK doesn't expose O_DIRECT. as discussed here: Force JVM to do all IO without page cache (e.g. O_DIRECT) - essentially, use JNI youself or use a nice 3rd party lib. I've heard good things about JNA: https://github.com/java-native-access/jna ...
Alternatively, if you have control over the client mount point, you can use the sync mount option, as per NFS manual. It says:
If the sync option is specified on a mount point, any system call
that writes data to files on that mount point causes that data to be
flushed to the server before the system call returns control to user
space. This provides greater data cache coherence among clients, but
at a significant performance cost.
This could be what you're looking for.

Generally, Java's streams make no guarantee about the effects of flush apart from maybe the flushing of the buffers in the Java classes involved.
To overcome that limitation, Java NIO's Channels can be used, see e.g https://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#force(boolean). However, if "the file does not reside on a local device then no such guarantee is made." And Java cannot make such a guarantee, because the underlying remote file system or protocol may not be able to provide that function at all. However, you should be able to achieve (almost) the same level of synchronization with force() that you'd get from a native O_DIRECT access that #SusanW mentioned.

Impact of writing to multiple opened files

I am trying to optimize the logging system of an Android application which causes some unwanted latency. There are multiple files opened which log different parts and should be kept separate.
I am not very familiar with low-level filesystem design and even less with current flash and/or SSD memory used in mobile phones (opposed to traditional HDD). I assume that memory is organized in disk blocks (512B or 4096B more recently) and some form of continuous, linked or indexed allocations is used.
I am using BufferedOutputStreams with buffer size of 256B, but this values is chosen at random (this provides a good answer for buffer size).
Writing in append mode to multiple opened files creates additional overhead that can significantly decrease performance (from allocation strategy for ex.)? Is it influenced greatly by the buffered output buffer size (this particular case of multiple files)?
I am using Android which tends to have a variety of FSs and makes it hard to understand how each influences the append to multiple opened files. Probably the I/O functions of Java or any other are very similar.
My search for this particular issue turned empty or maybe I need some domain specific terms in my search that I am not familiar with.

How do you write zero copy in java? What are the main differences

I was reading about how you can use the java nio library to take advantage of file transfer/buffering at the O/S level which is called 'zero copy'.
What are the differences in how you create/write to files then?
Are there any drawbacks to using zero-copy?

zero copy means that your program will not transfer the data from the kernel space to the user space and so on. this is faster
nice article can be found here:
https://developer.ibm.com/articles/j-zerocopy/

Zero copy is a technique where the application is no longer the 'middleman' in transferring data from a disk to the socket. Applications that use zero copy request that the kernel copy the data directly from the disk file to the socket, without going through the application, which improves performance and reduces context switches.
It all depends on what the application will do with the data it reads from disks. If it is a web application serving a lot of static content by reading files and relaying them over sockets, then zero copy is the way to go in order to get better performance. However, if the application is using the data locally (either crunching it in some way and then writing it back, or displaying it locally without persisting it back), you would not use zero copy.
This IBM DeveloperWorks article about zero copy is a good read.
Other ways of file I/O in java are via the use of Stream classes based on the type of file you would want to read/write. This involves both buffered and unbuffered streams, although usually buffered streams promise better performance since they cause less I/O seek cycles and hence lesser context switches.

Any concept of shared memory in Java

AFAIK, memory in Java is based on heap from which the memory is allotted to objects dynamically and there is no concept of shared memory.
If there is no concept of shared memory, then the communication between Java programs should be time consuming. In C where inter-process communication is quicker via shared memory compared to other modes of communication.
Correct me if I'm wrong. Also what is the quickest way for 2 Java progs to talk to each other.

A few ways:
RAM Drive
Apache APR
OpenHFT Chronicle Core
Details here and here with some performance measurements.

Since there is no official API to create a shared memory segment, you need to resort to a helper library/DDL and JNI to use shared memory to have two Java processes talk to each other.
In practice, this is rarely an issue since Java supports threads, so you can have two "programs" run in the same Java VM. Those will share the same heap, so communication will be instantaneous. Plus you can't get errors because of problems with the shared memory segment.

Java Chronicle is worth looking at; both Chronicle-Queue and Chronicle-Map use shared memory.
These are some tests that I had done a while ago comparing various off-heap and on-heap options.

One thing to look at is using memory-mapped files, using Java NIO's FileChannel class or similar (see the map() method). We've used this very successfully to communicate (in our case one-way) between a Java process and a C native one on the same machine.
I'll admit I'm no filesystem expert (luckily we do have one on staff!) but the performance for us is absolutely blazingly fast -- effectively you're treating a section of the page cache as a file and reading + writing to it directly without the overhead of system calls. I'm not sure about the guarantees and coherency -- there are methods in Java to force changes to be written to the file, which implies that they are (sometimes? typically? usually? normally? not sure) written to the actual underlying file (somewhat? very? extremely?) lazily, meaning that some proportion of the time it's basically just a shared memory segment.
In theory, as I understand it, memory-mapped files CAN actually be backed by a shared memory segment (they're just file handles, I think) but I'm not aware of a way to do so in Java without JNI.

Shared memory is sometimes quick. Sometimes its not - it hurts CPU caches and synchronization is often a pain (and should it rely upon mutexes and such, can be a major performance penalty).
Barrelfish is an operating system that demonstrates that IPC using message passing is actually faster than shared memory as the number of cores increases (on conventional X86 architectures as well as the more exotic NUMA NUCA stuff you'd guess it was targeting).
So your assumption that shared memory is fast needs testing for your particular scenario and on your target hardware. Its not a generic sound assumption these days!

There's a couple of comparable technologies I can think of:
A few years back there was a technology called JavaSpaces but that never really seemed to take hold, a shame if you ask me.
Nowadays there are the distributed cache technologies, things like Coherence and Tangosol.
Unfortunately neither will have the out right speed of shared memory, but they do deal with the issues of concurrent access, etc.

The easiest way to do that is to have two processes instantiate the same memory-mapped file. In practice they will be sharing the same off-heap memory space. You can grab the physical address of this memory and use sun.misc.Unsafe to write/read primitives. It supports concurrency through the putXXXVolatile/getXXXVolatile methods. Take a look on CoralQueue which offers IPC easily as well as inter-thread communication inside the same JVM.
Disclaimer: I am one of the developers of CoralQueue.

Similar to Peter Lawrey's Java Chronicle, you can try Jocket.
It also uses a MappedByteBuffer but does not persist any data and is meant to be used as a drop-in replacement to Socket / ServerSocket.
Roundtrip latency for a 1kB ping-pong is around a half-microsecond.

MappedBus (http://github.com/caplogic/mappedbus) is a library I've added on github which enable IPC between multiple (more than two) Java processes/JVMs by message passing.
The transport can be either a memory mapped file or shared memory. To use it with shared memory simply follow the examples on the github page but point the readers/writers to a file under "/dev/shm/".
It's open source and the implementation is fully explained on the github page.

The information provided by Cowan is correct. However, even shared memory won't always appear to be identical in multiple threads (and/or processes) at the same time. The key underlying reason is the Java memory model (which is built on the hardware memory model). See Can multiple threads see writes on a direct mapped ByteBuffer in Java? for a quite useful discussion of the subject.

Performance / stability of a Memory Mapped file - Native or MappedByteBuffer - vs. plain ol' FileOutputStream

I support a legacy Java application that uses flat files (plain text) for persistence. Due to the nature of the application, the size of these files can reach 100s MB per day, and often the limiting factor in application performance is file IO. Currently, the application uses a plain ol' java.io.FileOutputStream to write data to disk.
Recently, we've had several developers assert that using memory-mapped files, implemented in native code (C/C++) and accessed via JNI, would provide greater performance. However, FileOutputStream already uses native methods for its core methods (i.e. write(byte[])), so it appears a tenuous assumption without hard data or at least anecdotal evidence.
I have several questions on this:
Is this assertion really true?
Will memory mapped files always
provide faster IO compared to Java's
FileOutputStream?
Does the class MappedByteBuffer
accessed from a FileChannel provide
the same functionality as a native
memory mapped file library accessed
via JNI? What is MappedByteBuffer
lacking that might lead you to use a
JNI solution?
What are the risks of using
memory-mapped files for disk IO in a production
application? That is, applications
that have continuous uptime with
minimal reboots (once a month, max).
Real-life anecdotes from production
applications (Java or otherwise)
preferred.
Question #3 is important - I could answer this question myself partially by writing a "toy" application that perf tests IO using the various options described above, but by posting to SO I'm hoping for real-world anecdotes / data to chew on.
[EDIT] Clarification - each day of operation, the application creates multiple files that range in size from 100MB to 1 gig. In total, the application might be writing out multiple gigs of data per day.

Memory mapped I/O will not make your disks run faster(!). For linear access it seems a bit pointless.
A NIO mapped buffer is the real thing (usual caveat about any reasonable implementation).
As with other NIO direct allocated buffers, the buffers are not normal memory and wont get GCed as efficiently. If you create many of them you may find that you run out of memory/address space without running out of Java heap. This is obviously a worry with long running processes.

You might be able to speed things up a bit by examining how your data is being buffered during writes. This tends to be application specific as you would need an idea of the expected data writing patterns. If data consistency is important, there will be tradeoffs here.
If you are just writing out new data to disk from your application, memory mapped I/O probably won't help much. I don't see any reason you would want to invest time in some custom coded native solution. It just seems like too much complexity for your application, from what you have provided so far.
If you are sure you really need better I/O performance - or just O performance in your case, I would look into a hardware solution such as a tuned disk array. Throwing more hardware at the problem is often times more cost effective from a business point of view than spending time optimizing software. It is also usually quicker to implement and more reliable.
In general, there are a lot of pitfalls in over optimization of software. You will introduce new types of problems to your application. You might run into memory issues/ GC thrashing which would lead to more maintenance/tuning. The worst part is that many of these issues will be hard to test before going into production.
If it were my app, I would probably stick with the FileOutputStream with some possibly tuned buffering. After that I'd use the time honored solution of throwing more hardware at it.

From my experience, memory mapped files perform MUCH better than plain file access in both real time and persistence use cases. I've worked primarily with C++ on Windows, but Linux performances are similar, and you're planning to use JNI anyway, so I think it applies to your problem.
For an example of a persistence engine built on memory mapped file, see Metakit. I've used it in an application where objects were simple views over memory-mapped data, the engine took care of all the mapping stuff behind the curtains. This was both fast and memory efficient (at least compared with traditional approaches like those the previous version used), and we got commit/rollback transactions for free.
In another project I had to write multicast network applications. The data was send in randomized order to minimize the impact of consecutive packet loss (combined with FEC and blocking schemes). Moreover the data could well exceed the address space (video files were larger than 2Gb) so memory allocation was out of question. On the server side, file sections were memory-mapped on demand and the network layer directly picked the data from these views; as a consequence the memory usage was very low. On the receiver side, there was no way to predict the order into which packets were received, so it has to maintain a limited number of active views on the target file, and data was copied directly into these views. When a packet had to be put in an unmapped area, the oldest view was unmapped (and eventually flushed into the file by the system) and replaced by a new view on the destination area. Performances were outstanding, notably because the system did a great job on committing data as a background task, and real-time constraints were easily met.
Since then I'm convinced that even the best fine-crafted software scheme cannot beat the system's default I/O policy with memory-mapped file, because the system knows more than user-space applications about when and how data must be written. Also, what is important to know is that memory mapping is a must when dealing with large data, because the data is never allocated (hence consuming memory) but dynamically mapped into the address space, and managed by the system's virtual memory manager, which is always faster than the heap. So the system always use the memory optimally, and commits data whenever it needs to, behind the application's back without impacting it.
Hope it helps.

As for point 3 - if the machine crashes and there are any pages that were not flushed to disk, then they are lost. Another thing is the waste of the address space - mapping a file to memory consumes address space (and requires contiguous area), and well, on 32-bit machines it's a bit limited. But you've said about 100MB - so it should not be a problem. And one more thing - expanding the size of the mmaped file requires some work.
By the way, this SO discussion can also give you some insights.

If you write fewer bytes it will be faster. What if you filtered it through gzipoutputstream, or what if you wrote your data into ZipFiles or JarFiles?

As mentioned above, use NIO (a.k.a. new IO). There's also a new, new IO coming out.
The proper use of a RAID hard drive solution would help you, but that would be a pain.
I really like the idea of compressing the data. Go for the gzipoutputstream dude! That would double your throughput if the CPU can keep up. It is likely that you can take advantage of the now-standard double-core machines, eh?
-Stosh

I did a study where I compare the write performance to a raw ByteBuffer versus the write performance to a MappedByteBuffer. Memory-mapped files are supported by the OS and their write latencies are very good as you can see in my benchmark numbers. Performing synchronous writes through a FileChannel is approximately 20 times slower and that's why people do asynchronous logging all the time. In my study I also give an example of how to implement asynchronous logging through a lock-free and garbage-free queue for ultimate performance very close to a raw ByteBuffer.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.