For direct mapped buffer, they always stay outside jvm heap. Channels, on the other hand, seem to also exists in the IO operation. I'm just wondering if a the corresponding channel for the direct mapped buffer also stays outside jvm heap?
Also, The other question comes from the necessity of existence of Channel in terms of memory operation efficiency. I understand that
A channel represents an open connection to an entity such as a hardware device
Yet, in the situation of writing to file through the direct mapped buffer, is the contents being written twice? The content is first written to the buffer then to the channel. Would this be low efficiency versus "directly" writing to the IO device?
I'm just wondering if a the corresponding channel for the direct mapped buffer also stays outside jvm heap?
The question doesn't make sense. A Channel isn't a piece of memory, it is an interface to an operating system FD.
in the situation of writing to file through the direct mapped buffer, is the contents being written twice? The content is first written to the buffer then to the channel. Would this be low efficiency versus "directly" writing to the IO device?
No. The MappedByteBuffer is independent of the channel it came from. For example, it isn't closed when the channel is closed.
Are you perhaps looking for direct byte buffers? They do exist, and you write to them via channels, but I/O via them happens once, not twice.
I think I figure out the answer.
Direct mapped buffer first write to a buffer, which is non IO operation. The IO operation for real write that updates the file happens afterwards. The reason for two write is that since we have DMA, we don't want all IO operation happen by times. All IO operation in one time is more efficient.
Getting the starting address through the JNI is not correct address for accessing that piece of memory. Instead, the address for data is a field is hidden in java.nio.buffer. One way to get this field is though sun.misc.Unsafe in making the field public.
Related
How actually a buffer optimize the process of reading/writing?
Every time when we read a byte we access the file. I read that a buffer reduces the number of accesses the file. The question is how?. In the Buffered section of picture, when we load bytes from the file to the buffer we access the file just like in Unbuffered section of picture so where is the optimization?
I mean ... the buffer must access the file every time when reads a byte so
even if the data in the buffer is read faster this will not improve performance in the process of reading. What am I missing?
The fundamental misconception is to assume that a file is read byte by byte. Most storage devices, including hard drives and solid-state discs, organize the data in blocks. Likewise, network protocols transfer data in packets rather than single bytes.
This affects how the controller hardware and low-level software (drivers and operating system) work. Often, it is not even possible to transfer a single byte on this level. So, requesting the read of a single byte ends up reading one block and ignoring everything but one byte. Even worse, writing a single byte may imply reading an entire block, changing one bye of it, and writing the block back to the device. For network transfers, sending a packet with a payload of only one byte implies using 99% of the bandwidth for metadata rather than actual payload.
Note that sometimes, an immediate response is needed or a write is required to be definitely completed at some point, e.g. for safety. That’s why unbuffered I/O exists at all. But for most ordinary use cases, you want to transfer a sequence of bytes anyway and it should be transferred in chunks of a size suitable to the underlying hardware.
Note that even if the underlying system injects a buffering on its own or when the hardware truly transfers single bytes, performing 100 operating system calls to transfer a single byte on each still is significantly slower than performing a single operating system call telling it to transfer 100 bytes at once.
But you should not consider the buffer to be something between the file and your program, as suggested in your picture. You should consider the buffer to be part of your program. Just like you would not consider a String object to be something between your program and a source of characters, but rather a natural way to process such items. E.g. when you use the bulk read method of InputStream (e.g. of a FileInputStream) with a sufficiently large target array, there is no need to wrap the input stream in a BufferedInputStream; it would not improve the performance. You should just stay away from the single byte read method as much as possible.
As another practical example, when you use an InputStreamReader, it will already read the bytes into a buffer (so no additional BufferedInputStream is needed) and the internally used CharsetDecoder will operate on that buffer, writing the resulting characters into a target char buffer. When you use, e.g. Scanner, the pattern matching operations will work on that target char buffer of a charset decoding operation (when the source is an InputStream or ByteChannel). Then, when delivering match results as strings, they will be created by another bulk copy operation from the char buffer. So processing data in chunks is already the norm, not the exception.
This has been incorporated into the NIO design. So, instead of supporting a single byte read method and fixing it by providing a buffering decorator, as the InputStream API does, NIO’s ByteChannel subtypes only offer methods using application managed buffers.
So we could say, buffering is not improving the performance, it is the natural way of transferring and processing data. Rather, not buffering is degrading the performance by requiring a translation from the natural bulk data operations to single item operations.
As stated in your picture, buffered file contents are saved in memory and unbuffered file is not read directly unless it is streamed to program.
File is only representation on path only. Here is from File Javadoc:
An abstract representation of file and directory pathnames.
Meanwhile, buffered stream like ByteBuffer takes content (depends on buffer type, direct or indirect) from file and allocate it into memory as heap.
The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance.
Actually depends on the condition, if the file is accessed repeatedly, then buffered is a faster solution rather than unbuffered. But if the file is larger than main memory and it is accessed once, unbuffered seems to be better solution.
Basically for reading if you request 1 byte the buffer will read 1000 bytes and return you the first byte, for next 999 reads for 1 byte it will not read anything from the file but use its internal buffer in RAM. Only after you read all the 1000 bytes it will actually read another 1000 bytes from the actual file.
Same thing for writing but in reverse. If you write 1 byte it will be buffered and only if you have written 1000 bytes they may be written to the file.
Note that choosing the buffer size changes the performance quite a bit, see e.g. https://stackoverflow.com/a/237495/2442804 for further details, respecting file system block size, available RAM, etc.
With reference to the stackoverflow question it is said that the InputStream can be read multiple times with mark() and reset() provided by the InputStream or by using PushbackInputStream.
In all these cases the content of the stream is stored in byte array (ie; the original content of the file is stored in main memory) and reused multiple times.
What happens when the size of the file exceeds the memory size? I think this may pave way for OutOfMemoryException.
Is there any better way to read the stream content multiple times without storing the stream content locally (ie; in main memory)?
Please help me knowing this. Thanks in advance.
It depends on the source of the stream.
If it's a local file, you can likely re-open and re-read the stream as many times as you want.
If it's dynamically generated by a process, a remote service, etc., you might not be free to re-generate it. In that case, you need to store it, either in memory or in some more persistent (and slow) storage like a file system or storage service.
Maybe an analogy would help. Suppose your friend is speaking to you at length. You listen carefully without interruption, but when they are done, you realize you didn't understand something they said near the beginning, and want to review that portion.
At this point, there are a few possibilities.
Perhaps your friend was actually reading aloud from a book. You can simply re-read the book.
Or, perhaps you had to foresight to record their monologue. You can replay the recording.
However, since neither you nor your friend has perfect and unlimited recall, simply repeating verbatim what was said ten minutes ago from memory alone is not an option.
An InputStream is like your friend speaking. Neither of you has a good enough memory to remember exactly, word-for-word, what is said. In the same way, neither a process that is generating the data stream nor your program has enough RAM to store, byte-for-byte, the stream. To scale, your program has to rely on its "short-term memory" (RAM), working with just a small portion of the whole stream at any given time, and "taking notes" (writing to a persistent store) as it encounters important points.
If the source of stream is a local file, then it's like your friend reading a book. Either of you can re-read that content easily enough.
If you copy the stream to some persistent storage, that's like recording your friend's speech. You can replay it as often as you like.
Consider a scenario where browser is uploading a large file, but the server is busy, and not able to read that stream for some time. Where is that data stored during that delay?
Because the receiver can't always respond immediately to input, TCP and many other protocols allocate a small buffer to store some data from a sender. But, they also have a way to tell the sender to wait, they are sending data too fast—flow control. Going back to the analogy, it's like telling your friend to pause a moment while you catch up with your note-taking.
As the browser uploads the file, at first, the buffer will be filled. But if the server can't keep up, the browser will be instructed to pause its upload until there is more room in the buffer. (This generally happens at the OS and TCP level; the client and server applications don't manage this directly.) The upload speed depends on how fast the browser can read the file from disk, how fast the network link is, and how fast the server can process the uploaded data. Even a fast network and client will be limited by the weak link in this chain.
I'm currently using Java sockets in a client-server application with OutputStream and not BufferedOutputStream (and the same for input streams).
The client and server exchanges serialized objects (writeObject() method).
Does it make sense (more speed) to use BufferedOutputStream and BufferedInputStream in this case?
And when I have to flush or should I not write a flush() statement?
Does it make sense (more speed) to use BufferedOutputStream and BufferedInputStream in this case?
Actually, it probably doesn't make sense1.
The object stream implementation internally wraps the stream it has been given with a private class called BlockDataOutputStream that does buffering. If you wrap the stream yourself, you will have two levels of buffering ... which is likely to make performance worse2.
And when I have to flush or should I not write a flush() statement?
Yes, flushing is probably necessary. But there is no universal answer as to when to do it.
On the one hand, if you flush too often, you generate extra network traffic.
On the other hand, if you don't flush when it is needed, the server can stall waiting for an object that the client has written but not flushed.
You need to find the compromise between these two syndromes ... and that depends on your application's client/server interaction patterns; e.g. whether the message patterns are synchronous (e.g. message/response) or asynchronous (e.g. message streaming).
1 - To be certain on this, you would need to do some forensic testing to 1) measure the system performance, and 2) determine what syscalls are made and when network packets are sent. For a general answer, you would need to repeat this for a number of use-cases. I'd also recommend looking at the Java library code yourself to confirm my (brief) reading.
2 - Probably only a little bit worse, but a well designed benchmark would pick up a small performance difference.
UPDATE
After writing the above, I found this Q&A - Performance issue using Javas Object streams with Sockets - which seems to suggest that using BufferedInputStream / BufferedOutputStream helps. However, I'm not certain whether the performance improvement that was reported is 1) real (i.e. not a warmup artefact) and 2) due to the buffering. It could be just due to adding the flush() call. (Why: because the flush could cause the network stack to push the data sooner.)
I think these links might help you:
What is the purpose of flush() in Java streams?
The flush method flushes the output stream and forces any buffered output bytes to be written out. The general contract of flush is that calling it is an indication that, if any bytes previously written have been buffered by the implementation of the output stream, such bytes should immediately be written to their intended destination.
How java.io.Buffer* stream differs from normal streams?
Internally a buffer array is used and instead of reading bytes individually from the underlying input stream enough bytes are read to fill the buffer. This generally results in faster performance as less reads are required on the underlying input stream.
http://www.oracle.com/technetwork/articles/javase/perftuning-137844.html
As a means of starting the discussion, here are some basic rules on how to speed up I/O: 1.Avoid accessing the disk. 2.Avoid accessing the underlying operating system. 3.Avoid method calls. 4.Avoid processing bytes and characters individually.
So using Buffered-Streams usually speeds speeds up the IO-processe, as less read() are done in the background.
Which of the two would be the best choice and in which circumstance?
Clearly there is no sense in using a file channel for a very small file. Besides that, what are the pro and cons of the two input/output means.
Thanks a lot in advance.
FileChannel has many features missing in java.io: it is interruptible, it can move position within the file, it can lock a file, etc. And it can be faster than old IO, especially when it uses direct byte buffers, here is an explanation from ByteBuffer API:
byte buffer is either direct or non-direct. Given a direct byte buffer, the Java virtual machine will make a best effort to perform native I/O operations directly upon it. That is, it will attempt to avoid copying the buffer's content to (or from) an intermediate buffer before (or after) each invocation of one of the underlying operating system's native I/O operations.
If you need none of the above features go with streams, you'll get a shorter code.
Is it more efficient to flush the OutputStream after each individual invocation of ObjectOutputStream#writeObject rather than flushing the stream after a sequence of object writes? (Example: write object and flush 4 times, or write 4 times and then just flush once?)
How does ObjectOutputStream work internally?
Is it somehow better sending four Object[5] (flushing each one) than a Object[20], for example?
It is not better. In fact it is probably worse, from a performance perspective. Each of those flushes will force the OS-level TCP/IP stack to send the data "right now". If you just do one flush at the end, you should save on system calls, and on network traffic.
If you haven't done this already, inserting a BufferedOutputStream between the Socket OutputStream and the ObjectOutputStream will make a much bigger difference to performance. This allows the serialized data to accumulate in memory before being written to the socket stream. This potentially save many system calls and could improve performance by orders of magnitude ... depending on the actual objects being sent.
(The representation of four Object[5] objects is larger than one Object[20] object, and that results in a performance hit in the first case. However, this is marginal at most, and tiny compared with the flushing and buffering issues.)
How does this stream work internally?
That is too general a question to answer sensibly. I suggest that you read up on serialization starting with the documents on this page.
No, it shouldn't matter, unless you have reason to believe the net link is likely to go down, and partial data is useful. Otherwise it just sounds like a way to make the code more complex for no reason.
If you look at the one and only public constructor of ObjectOutputStream, you note that it requires an underlying OutputStream for its instantiation.
When and how you flush your ObjectStream is entirely dependent on the type of stream you are using. (And in considering all this, do keep in mind that not all extension of OutputStream are guaranteed to respect your request to flush -- it is entirely implementation independent, as it is spelled out in the 'contract' of the javadocs.)
But certainly we can reason about it and even pull up the code and see what is actually done.
IFF the underlying OutputStream must utilize the OS services for devices (such as the disk or the network interface in case of Sockets) then the behavior of flush() is entirely OS dependent. For example, you may grab the output stream of a socket and then instantiate an ObjectOutputStream to write serialized objects to the net. TCP/IP implementation of the host OS is in charge.
What is more efficient?
Well, if your object stream is wrapping a ByteArrayOutputStream, you are potentially looking at a series of reallocs and System.arrayCopy() calls. I say potentially, since the implementation of byte array doubles the size on each (internal) resize() op and it is very unlikely that writing n (small) objects and flushing each time will result in n reallocs. (Where n is assumed to be a reasonably small number).
But if you are wrapping a network stream, you must keep in mind that network writes are very expensive. It makes much more sense, if your protocol allows it, to chunk your writes (to fill the send buffer) and just flush once.