Performance of Socket write vs disk write

Performance of Socket write vs disk write - java

My java application logs a fair amount of information to a logfile on disk. Some of this logged information is more important than the rest; except that in rare cases the less-important info is needed to explain to the end-user why the code in production took a certain decision.
I was wondering if it will be a good idea to log the less important information to a socket instead of the file on disk. Is socket write significantly faster than disk write?
Update: Basically, I wanted to log to a socket in the same subnet or even the same machine, assuming that it would be faster than writing to disk. Another process (not part of my application) would then read from that socket at its convenience. I was thinking this would be logstash pulling from a socket. Async logging to disk using another thread is another alternative but I wanted to consider the socket option first if that is an easy solution with minimal performance hit.

You have few choices:
local storage is usually faster than network
you could use async logging to disk, so your process fires and forgets (which is fast!)
logstash can read from Unix Domain Sockets, if you are on *nix; these are usually faster than I/O
If you are writing somewhere fast and from there it is being forwarded in a slower fashion (logstash logging over network to some Elastic instance) where is the buffering happening? Such setup will generate growing backlog of messages yet to be shipped if the logging happens at high rate for prolonged period of time.
In the above scenarios buffering will happen (respectively):
direct sync write to disk: final log file on the disk is the buffer
async logging framework: buffers could eat into your heap or process memory (when outside of heap, or in some kernel area, therefore in RAM)
unix domain sockets: buffered in the kernel space, so RAM again
In the last 2 options things will get increasingly creaky in constant high volume scenario.
Test and profile...
or just log to the local disk and rotate the files, deleting old ones.

Socket is not a destination. It's a transport. Your question "send data to socket" should therefore be rephrased to "send data to network", "send data to disk" or "send data to another process".
In all these cases, socket itself is unlikely to be a bottleneck. The bottleneck will be either network, disk or application CPU usage - depending on where you are actually sending your data from the socket. On OS level, sockets are usually implemented as zero-copy mechanism, which means that the data is just passed to the other side as a pointer and is therefore highly efficient.

Related

Using files for for shared memory IPC

In my application, there is one process which writes data to a file, and then, in response to receiving a request, will send (some) of that data via the network to the requesting process. The basis of this question is to see if we can speed up communication when both processes happen to be on the same host. (In my case, the processes are Java, but I think this discussion can apply more broadly.)
There are a few projects out there which use the MappedByteBuffers returned by Java's FileChannel.map() as a way to have shared memory IPC between JVMs on the same host (see Chronicle Queue, Aeron IPC, etc.).
One approach to speeding up same-host communication would be to have my application use one of those technologies to provide the request-response pathway for same-host communication, either in conjunction with the existing mechanism for writing to the data file, or by providing a unified means of both communication and writing to the file.
Another approach would be to allow the requesting process to have direct access to the data file.
I tend to favor the second approach - assuming it would be correct - as it would be easier to implement, and seems more efficient than copying/transmitting a copy of the data for each request (assuming we didn't replace the existing mechanism for writing to the file).
Essentially, I'd like to understanding what exactly occurs when two processes have access to the same file, and use it to communicate, specifically Java (1.8) and Linux (3.10).
From my understanding, it seems like if two processes have the same file open at the same time, the "communication" between them will essentially be via "shared memory".
Note that this question is not concerned with the performance implication of using a MappedByteBuffer or not - it seem highly likely that a using mapped buffers, and the reduction in copying and system calls, will reduce overhead compared to reading and writing the file, but that might require significant changes to the application.
Here is my understanding:
When Linux loads a file from disk, it copies the contents of that file to pages in memory. That region of memory is called the page cache. As far as I can tell, it does this regardless of which Java method (FileInputStream.read(), RandomAccessFile.read(), FileChannel.read(), FileChannel.map()) or native method is used to read the file (obseved with "free" and monitoring the "cache" value).
If another process attempts to load the same file (while it is still resident in the cache) the kernel detects this and doesn't need to reload the file. If the page cache gets full, pages will get evicted - dirty ones being written back out to the disk. (Pages also get written back out if there is an explicit flush to disk, and periodically, with a kernel thread).
Having a (large) file already in the cache is a significant performance boost, much more so than the differences based on which Java methods we use to open/read that file.
If a file is loaded using the mmap system call (C) or via FileChannel.map() (Java), essentially the file's pages (in the cache) are loaded directly into the process' address space. Using other methods to open a file, the file is loaded into pages not in the process' address space, and then the various methods to read/write that file copy some bytes from/to those pages into a buffer in the process' address space. There is an obvious performance benefit avoiding that copy, but my question isn't concerned with performance.
So in summary, if I understand correctly - while mapping offer a performance advantage, it doesn't seem like it offers any "shared memory" functionality that we don't already get just from the nature of the Linux and the page cache.
So, please let me know where my understanding is off.
Thanks.

My question is, on Java (1.8) and Linux (3.10), are MappedByteBuffers really necessary for implementing shared memory IPC, or would any access to a common file provide the same functionality?
It depends on why you want to implement shared memory IPC.
You can clearly implement IPC without shared memory; e.g. over sockets. So, if you are not doing it for performance reasons, it is not necessary to do shared memory IPC at all!
So performance has to be at the root of any discussion.
Access using files via the Java classic io or nio APIs does not provide shared memory functionality or performance.
The main difference between regular file I/O or Socket I/O versus shared memory IPC is that the former requires the applications to explicitly make read and write syscalls to send and receive messages. This entails extra syscalls, and entails the kernel copying data. Furthermore, if there are multiple threads you either need a separate "channel" between each thread pair or something to multiplexing multiple "conversations" over a shared channel. The latter can lead to the shared channel becoming a concurrency bottleneck.
Note that these overheads are orthogonal to the Linux page cache.
By contrast, with IPC implemented using shared memory, there are no read and write syscalls, and no extra copy step. Each "channel" can simply use a separate area of the mapped buffer. A thread in one process writes data into the shared memory and it is almost immediately visible to the second process.
The caveat is that the processes need to 1) synchronize, and 2) implement memory barriers to ensure that the reader doesn't see stale data. But these can both be implemented without syscalls.
In the wash-up, shared memory IPC using memory mapped files >>is<< faster than using conventional files or sockets, and that is why people do it.
You also implicitly asked if shared memory IPC can be implemented without memory mapped files.
A practical way would be to create a memory-mapped file for a file that lives in a memory-only file system; e.g. a "tmpfs" in Linux.
Technically, that is still a memory-mapped file. However, you don't incur overheads of flushing data to disk, and you avoid the potential security concern of private IPC data ending up on disk.
You could in theory implement a shared segment between two processes by doing the following:
In the parent process, use mmap to create a segment with MAP_ANONYMOUS | MAP_SHARED.
Fork child processes. These will end up all sharing the segment with each other and the parent process.
However, implementing that for a Java process would be ... challenging. AFAIK, Java does not support this.
Reference:
What is the purpose of MAP_ANONYMOUS flag in mmap system call?

Essentially, I'm trying to understand what happens when two processes have the same file open at the same time, and if one could use this to safely and performantly offer communication between to processes.
If you are using regular files using read and write operations (i.e. not memory mapping them) then the two processes do not share any memory.
User-space memory in the Java Buffer objects associated with the file is NOT shared across address spaces.
When a write syscall is made, data is copied from pages in one processes address space to pages in kernel space. (These could be pages in the page cache. That is OS specific.)
When a read syscall is made, data is copied from pages in kernel space to pages in the reading processes address space.
It has to be done that way. If the operating system shared pages associated with the reader and writer processes buffers behind their backs, then that would be an security / information leakage hole:
The reader would be able to see data in the writer's address space that had not yet been written via write(...), and maybe never would be.
The writer would be able to see data that the reader (hypothetically) wrote into its read buffer.
It would not be possible to address the problem by clever use of memory protection because the granularity of memory protection is a page versus the granularity of read(...) and write(...) which is as little as a single byte.
Sure: you can safely use reading and writing files to transfer data between two processes. But you would need to define a protocol that allows the reader to know how much data the writer has written. And the reader knowing when the writer has written something could entail polling; e.g. to see if the file has been modified.
If you look at this in terms of just the data copying in the communication "channel"
With memory mapped files you copy (serialize) the data from application heap objects to the mapped buffer, and a second time (deserialize) from the mapped buffer to application heap objects.
With ordinary files there are two additional copies: 1) from the writing processes (non-mapped) buffer to kernel space pages (e.g. in the page cache), 2) from the kernel space pages to the reading processes (non-mapped) buffer.
The article below explains what is going on with conventional read / write and memory mapping. (It is in the context of copying a file and "zero-copy", but you can ignore that.)
Reference:
Zero Copy I: User-Mode Perspective

Worth mentioning three points: performance, and concurrent changes, and memory utilization.
You are correct in the assessment that MMAP-based will usually offer performance advantage over file based IO. In particular, the performance advantage is significant if the code perform lot of small IO at artbitrary point of the file.
consider changing the N-th byte: with mmap buffer[N] = buffer[N] + 1, and with file based access you need (at least) 4 system calls + error checking:
seek() + error check
read() + error check
update value
seek() + error check
write + error check
It is true the the number of actual IO (to the disk) most likely be the same.
The second point worth noting the concurrent access. With file based IO, you have to worry about potential concurrent access. You will need to issue explicit locking (before the read), and unlock (after the write), to prevent two processes for incorrectly accessing the value at the same time. With shared memory, atomic operations can eliminate the need for additional lock.
The third point is actual memory usage. For cases where the size of the shared objects is significant, using shared memory can allow large number of processes to access the data without allocating additional memory. If systems constrained by memory, or system that need to provide real-time performance, this could be the only way to access the data.

Why java FileOutputStream's write() or flush() doesn't make NFS client really send data to NFS server?

My Java web application use NFS file system, I use FileOutputStream to open, write multiple chunks and then close the file.
From the profiler stats I found that stream.write(byte[] payload,int begin, int length) and even stream.flush() takes zero milliseconds. Only the method call stream.close() takes non-zero milliseconds.
It seems that java FileOutputStream's write() or flush() doesn't really cause NFS client to send data to NFS server. Is there any other Java class will make NFS client flush data in real time? or there is some NFS client tuning need to be done?

You are probably running into Unix client-side caching. There are lots of details here in the O'Reilly NFS book.
But in short:
Using the buffer cache and allowing async threads to cluster multiple buffers introduces some problems when several machines are reading from and writing to the same file. To prevent file inconsistency with multiple readers and writers of the same file, NFS institutes a flush-on-close policy:
All partially filled NFS data buffers for a file are written to the NFS server when the file is closed.
For NFS Version 3 clients, any writes that were done with the stable flag set to off are forced onto the server's stable storage via the commit operation.
NFS cache consistency uses an approach called close-to-open cache consistency - that is, you have to close the file before your server (and other clients) get a consistent up-to-date view of the file. You are seeing the downsides of this approach, which aims to minimize server hits.
Avoiding the cache is hard from Java. You'd need to set the file open() O_DIRECT flag if you're using Linux; see this answer for more https://stackoverflow.com/a/16259319/5851520, but basically it disables the client's OS cache for that file, though not the server's.
Unfortunately, the standard JDK doesn't expose O_DIRECT. as discussed here: Force JVM to do all IO without page cache (e.g. O_DIRECT) - essentially, use JNI youself or use a nice 3rd party lib. I've heard good things about JNA: https://github.com/java-native-access/jna ...
Alternatively, if you have control over the client mount point, you can use the sync mount option, as per NFS manual. It says:
If the sync option is specified on a mount point, any system call
that writes data to files on that mount point causes that data to be
flushed to the server before the system call returns control to user
space. This provides greater data cache coherence among clients, but
at a significant performance cost.
This could be what you're looking for.

Generally, Java's streams make no guarantee about the effects of flush apart from maybe the flushing of the buffers in the Java classes involved.
To overcome that limitation, Java NIO's Channels can be used, see e.g https://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#force(boolean). However, if "the file does not reside on a local device then no such guarantee is made." And Java cannot make such a guarantee, because the underlying remote file system or protocol may not be able to provide that function at all. However, you should be able to achieve (almost) the same level of synchronization with force() that you'd get from a native O_DIRECT access that #SusanW mentioned.

Should I use memory mapped files for my simple flow?

I am interested in implemented the following simple flow:
Client sends to a server process a simple message which the server stores. Since the message does not have any hierarchical structure IMO the best approach is to save it in a file instead of an rdb.
But I want to figure out how to optimize this since as I see it there are 2 choices:
Server sends a 200 OK to the client and then stores the message so
the client does not notice any delay
Server saves the message and then sends the 200OK but then the
client notices the overhead of the file I/O.
I prefer the performance of (1) but this could lead to a client thinking all went ok when actually the msg was never saved (for various error cases).
So I was thinking if I could use nio and memory mapped files.
But I was wondering is this a good candidate for using mem-mapped files? Would using a memory mapped file guarantee that e.g. if the process crashed the msg would be saved?
In my mind the flow would be creating/opening and closing many files so is this a good candidate for memory-mapping files?

Server saves the message and then sends the 200OK but then the client notices the overhead of the file I/O.
I suggest you test this. I doubt a human will notice a 10 milli-second delay and I expect you should get better than this for smaller messages.
So I was thinking if I could use nio and memory mapped files.
I use memory mapping as it can reduce the overhead per write by up to 5 micro-second. Is this important to you? If not, I would stick with the simplest approach.
Would using a memory mapped file guarantee that e.g. if the process crashed the msg would be saved?
As long as the OS doesn't crash, yes.
In my mind the flow would be creating/opening and closing many files so is this a good candidate for memory-mapping files?
Opening and closing files is likely to be fast more expensive than writing the data. (By an order of magnitude) I would suggest keeping such operations to a minimum.
You might find this library of mine interesting. https://github.com/peter-lawrey/Java-Chronicle It allows you to persist messages in the order of single digit micro-seconds for text and sub-micro-second for a small binary message.

Why Java NIO can be superior to standard Java sockets?

Recently I was playing with Java sockets and NIO for writing a server. Although it is still not really clear for me why Java NIO could be superior to standard sockets. When writing a server using either of these technologies, in most cases it comes down to having a dispatcher thread that accepts connections and further passes them to working threads.
I have read that in a threaded-model we need a dedicated thread per connection but still we can create a thread pool of a fixed size and reuse them to handle different connections (so that a cost of creation and tear down of threads is reduced).
But with Java NIO it looks similar. We have one thread that accepts requests and some worker thread(s) processing data when it is received.
An example I found where Java NIO would be better is a server that maintains many non-busy connections, like a chat client or http server. But can't really understand why.

There are several distinct reasons.
Using multiplexed I/O with a Selector can save you a lot of threads, which saves you a lot of thread stacks, which save you a lot of memory. On the other hand it moves scheduling from the operating system into your program, so it can cost you a bit of CPU, and it will also cost you a lot of programming complication. Given that select() was designed when the alternative was more processes, not more threads, it is in fact debatable whether the extra complication is really worth it, as against using threads and spending the programming money saved on more memory.
MappedByteBuffers are a slightly faster way of reading files than either java.io or using java.nio.channels with ByteBuffers.
If you are just copying from one channel to another, using 'direct' buffers saves you from having to copy data from the native JNI space into the JVM space and back again; or using the FileChannel.transferTo() method can save you from copying data from kernel space into user space.

Even though NIO supports the Dispatcher model, NIO Sockets are blocking by default and when you use them as such they can be faster than either plain IO or non-blocking NIO for a small (< 100) connections. I also find blocking NIO simpler to work with than non-blocking NIO.
I use non-blocking NIO when I want to use busy waiting. This allows be to have a thread which never gives up the CPU but this is only useful in rare cases i.e. where latency is ciritical.

From my benchmarks the real strength (besides threading model) is, that it consumes less memory bandwith (Kernel<=>Java). E.g. if you open several UDP NIO Multicast Channels and have high traffic you will notice that at a certain number of processes with each new process throughput of all running UDP receivers gets lower. With the traditional socket API i start 3 receiving processes with full throughput. If i start the 4th I reach a limit and received data/second will lower on all the running processes. With nio i can start about 6 processes until this effect kicks in.
I think this is mostly because NIO kind of directly bridges to native or kernel memory, while the old socket copies buffers to the VM process space.
Important in GRID computing and high load server apps (10GBit network or infiniband).

How to get threads to stop blocking each other from writing to log file on disk?

My threads have fallen behind schedule and a thread dump reveals they're all stuck in blocking IO writing log output to hard disk. My quick fix is just to reduce logging, which is easy enough to do with respect to my QA requirements. Of course, this isn't vertically scalable which will be a problem soon enough.
I thought about just increasing the thread count but I'm guessing the bottleneck is on file contention and this could be rather bad if it's the wrong thing to do.
I have a lot of ideas but really no idea which are fruitful.
I thought about increasing the thread count but I'm guessing they're bottlenecked so this won't do anything. Is this correct? How to determine it? Could decreasing the threadcount help?
How do I profile the right # of threads to be writing to disk? Is this a function of number of write requests, number of bytes written per second, number of bytes per write op, what else?
Can I toggle a lower-level setting (filesystem, OS, etc.) to reduce locking on a file in exchange for out-of-order lines being possible? Either in my Java application or lower level?
Can I profile my system or hard disk to ensure it's not somehow being overworked? (Vague, but I'm out of my domain here).
So my question is: how to profile to determine the right number of threads that can safely write to a common file? What variables determine this - number of write operations, number of bytes written per second, number of bytes per write request, any OS or hard disk information.
Also is there any way to make the log file more liberal to be written to? We timestamp everything so I'm okay with a minority of out-of-order lines if it reduces blocking.

My threads have fallen behind schedule and a thread dump reveals they're all stuck in blocking IO writing log output to hard disk.
Typically in these situations, I schedule a thread just for logging. Most logging classes (such as PrintStream) are synchronized and write/flush each line of output. By moving to a central logging thread and using some sort of BlockingQueue to queue up log messages to be written, you can make use of a BufferedWriter or some such to limit the individual IO requests. The default buffer size is 8k but you should increase that size. You'll need to make sure that you properly close the stream when your application shuts down.
With a buffered writer, you could additionally write through a GZIPOutputStream which would significantly lower your IO requirements if your log messages repeat a lot.
That said, if you are outputting too much debugging information, you may be SOL and need to either decrease your logging bandwidth or increase your disk IO speeds. Once you've optimized your application, the next steps include moving to SSD on your log server to handle the load. You could also try distributing the log messages to multiple servers to be persisted but a local SSD would most likely be faster.
To simulate the benefits of a SSD, a local RAM disk should give you a pretty good idea about increased IO bandwidth.
I thought about increasing the thread count but I'm guessing they're bottlenecked so this won't do anything. Is this correct?
If all your threads are blocked in IO, then yes, increasing the thread count will not help.
How do I profile the right # of threads to be writing to disk?
Tough question. You are going to have to do some test runs. See the throughput of your application with 10 threads, with 20, etc.. You are trying to maximize your overall transactions processed in some time. Make sure your test runs execute for a couple of minutes for best results. But, it is important to realize that a single thread can easily swamp a disk or network IO stream if it is spewing too much output.
Can I toggle a lower-level setting (filesystem, OS, etc.) to reduce locking on a file in exchange for out-of-order lines being possible? Either in my Java application or lower level?
No. See my buffered thread writer above. This is not about file locking which (I assume) is not happening. This is about number of IO requests/second.
Can I profile my system or hard disk to ensure it's not somehow being overworked? (Vague, but I'm out of my domain here).
If you are IO bound then the IO is slowing you down so it is "overworked". Moving to a SSD or RAM disk is an easy test to see if your application runs faster.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.