Memory mapped file java NIO

Memory mapped file java NIO - java

I understand how to create a memory mapped file, but my question is let's say that in the following line:
FileChannel roChannel = new RandomAccessFile(file, "r").getChannel();
ByteBuffer roBuf = roChannel.map(FileChannel.MapMode.READ_ONLY, 0, SIZE);
Where i set SIZE to be 2MB for example, does this means that it will only load 2MB of the file or will it read further in the file and update the buffer as i consume bytes from it?

Where i set SIZE to be 2MB for example, does this means that it will only load 2MB of the file or will it read further in the file and update the buffer as i consume bytes from it?
It will only load the portion of the file specified in your buffer initialization. If you want it to read further you'll need to have some sort of read loop. While I would not go as far as saying this is tricky, if one isn't 100% familiar with the java.io and java.nio APIs involved then the chances of stuffing it up are high. (E.g.: not flipping the buffer; buffer/file edge case mistakes).
If you are looking for an easy approach to accessing this file in a ByteBuffer, consider using a MappedByteBuffer.
RandomAccessFile raf = new RandomAccessFile(file, "r");
FileChannel fc = raf.getChannel();
MappedByteBuffer buffer = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
The nice thing about a using an MBB in this context is that it won't necessarily actually load the entire buffer into memory, but rather only the parts you are accessing.

The size of the buffer is the size you pass in. It will not grow or shrink.
The javadoc says:
Maps a region of this channel's file directly into memory.
...
size - The size of the region to be mapped; must be non-negative and no greater than Integer.MAX_VALUE
EDIT:
Depending on what you mean by "updated with new data", the answer is yes.
The view of a file provided by an instance of this class is guaranteed to be consistent with other views of the same file provided by other instances in the same program. The view provided by an instance of this class may or may not, however, be consistent with the views seen by other concurrently-running programs due to caching performed by the underlying operating system and delays induced by network-filesystem protocols. This is true regardless of the language in which these other programs are written, and whether they are running on the same machine or on some other machine. The exact nature of any such inconsistencies are system-dependent and are therefore unspecified.
So, other systems may do caching, but when those caches are flushed or otherwise up-to-date, they will agree with the view presented by the FileChannel.
You can also use explicit calls to the position method and other methods to change what is presented by the view.
Changing the channel's position, whether explicitly or by reading or writing bytes, will change the file position of the originating object, and vice versa. Changing the file's length via the file channel will change the length seen via the originating object, and vice versa. Changing the file's content by writing bytes will change the content seen by the originating object, and vice versa.

Related

Buffered vs Unbuffered. How actually buffer work?

How actually a buffer optimize the process of reading/writing?
Every time when we read a byte we access the file. I read that a buffer reduces the number of accesses the file. The question is how?. In the Buffered section of picture, when we load bytes from the file to the buffer we access the file just like in Unbuffered section of picture so where is the optimization?
I mean ... the buffer must access the file every time when reads a byte so
even if the data in the buffer is read faster this will not improve performance in the process of reading. What am I missing?

The fundamental misconception is to assume that a file is read byte by byte. Most storage devices, including hard drives and solid-state discs, organize the data in blocks. Likewise, network protocols transfer data in packets rather than single bytes.
This affects how the controller hardware and low-level software (drivers and operating system) work. Often, it is not even possible to transfer a single byte on this level. So, requesting the read of a single byte ends up reading one block and ignoring everything but one byte. Even worse, writing a single byte may imply reading an entire block, changing one bye of it, and writing the block back to the device. For network transfers, sending a packet with a payload of only one byte implies using 99% of the bandwidth for metadata rather than actual payload.
Note that sometimes, an immediate response is needed or a write is required to be definitely completed at some point, e.g. for safety. That’s why unbuffered I/O exists at all. But for most ordinary use cases, you want to transfer a sequence of bytes anyway and it should be transferred in chunks of a size suitable to the underlying hardware.
Note that even if the underlying system injects a buffering on its own or when the hardware truly transfers single bytes, performing 100 operating system calls to transfer a single byte on each still is significantly slower than performing a single operating system call telling it to transfer 100 bytes at once.
But you should not consider the buffer to be something between the file and your program, as suggested in your picture. You should consider the buffer to be part of your program. Just like you would not consider a String object to be something between your program and a source of characters, but rather a natural way to process such items. E.g. when you use the bulk read method of InputStream (e.g. of a FileInputStream) with a sufficiently large target array, there is no need to wrap the input stream in a BufferedInputStream; it would not improve the performance. You should just stay away from the single byte read method as much as possible.
As another practical example, when you use an InputStreamReader, it will already read the bytes into a buffer (so no additional BufferedInputStream is needed) and the internally used CharsetDecoder will operate on that buffer, writing the resulting characters into a target char buffer. When you use, e.g. Scanner, the pattern matching operations will work on that target char buffer of a charset decoding operation (when the source is an InputStream or ByteChannel). Then, when delivering match results as strings, they will be created by another bulk copy operation from the char buffer. So processing data in chunks is already the norm, not the exception.
This has been incorporated into the NIO design. So, instead of supporting a single byte read method and fixing it by providing a buffering decorator, as the InputStream API does, NIO’s ByteChannel subtypes only offer methods using application managed buffers.
So we could say, buffering is not improving the performance, it is the natural way of transferring and processing data. Rather, not buffering is degrading the performance by requiring a translation from the natural bulk data operations to single item operations.

As stated in your picture, buffered file contents are saved in memory and unbuffered file is not read directly unless it is streamed to program.
File is only representation on path only. Here is from File Javadoc:
An abstract representation of file and directory pathnames.
Meanwhile, buffered stream like ByteBuffer takes content (depends on buffer type, direct or indirect) from file and allocate it into memory as heap.
The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance.
Actually depends on the condition, if the file is accessed repeatedly, then buffered is a faster solution rather than unbuffered. But if the file is larger than main memory and it is accessed once, unbuffered seems to be better solution.

Basically for reading if you request 1 byte the buffer will read 1000 bytes and return you the first byte, for next 999 reads for 1 byte it will not read anything from the file but use its internal buffer in RAM. Only after you read all the 1000 bytes it will actually read another 1000 bytes from the actual file.
Same thing for writing but in reverse. If you write 1 byte it will be buffered and only if you have written 1000 bytes they may be written to the file.
Note that choosing the buffer size changes the performance quite a bit, see e.g. https://stackoverflow.com/a/237495/2442804 for further details, respecting file system block size, available RAM, etc.

FileChannel read behaviour [duplicate]

For example I have a file whose content is：
abcdefg
then i use the following code to read 'defg'.
ByteBuffer bb = ByteBuffer.allocate(4);
int read = channel.read(bb, 3);
assert(read == 4);
Because there's adequate data in the file so can I suppose so? Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?

Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?
The Javadoc says:
a read might not fill the buffer
and gives some examples, and
returns the number of bytes read, possibly zero, or -1 if the channel has reached end-of-stream.
This is NOT sufficient to allow you to make that assumption.
In practice, you are likely to always get a full buffer when reading from a file, modulo the end of file scenario. And that makes sense from an OS implementation perspective, given the overheads of making a system call.
But, I can also imagine situations where returning a half empty buffer might make sense. For example, when reading from a locally-mounted remote file system over a slow network link, there is some advantage in returning a partially filled buffer so that the application can start processing the data. Some future OS may implement the read system call to do that in this scenario. If assume that you will always get a full buffer, you may get a surprise when your application is run on the (hypothetical) new platform.
Another issue is that there are some kinds of stream where you will definitely get partially filled buffers. Socket streams, pipes and console streams are obvious examples. If you code your application assuming file stream behavior, you could get a nasty surprise when someone runs it against another kind of stream ... and fails.

No, in general you cannot assume that the number of bytes read will be equal to the number of bytes requested, even if there are bytes left to be read in the file.
If you are reading from a local file, chances are that the number of bytes requested will actually be read, but this is by no means guaranteed (and won't likely be the case if you're reading a file over the network).
See the documentation for the ReadableByteChannel.read(ByteBuffer) method (which applies for FileChannel.read(ByteBuffer) as well). Assuming that the channel is in blocking mode, the only guarantee is that at least one byte will be read.

What is the insertion order if I have two memory-mapped buffers mapped to the same file?

My question is whether the OS will respect the insertion order (i.e. last written, last to disk) or the order will be unpredictable. For example:
byte[] s1 = "Testing1!".getBytes();
byte[] s2 = "Testing2!".getBytes();
byte[] s3 = "Testing3!".getBytes();
RandomAccessFile raf = new RandomAccessFile("test.txt", "rw");
FileChannel fc = raf.getChannel();
MappedByteBuffer mbb1 = fc.map(MapMode.READ_WRITE, 0, 1024 * 1024);
mbb1.put(s1);
MappedByteBuffer mbb2 = fc.map(MapMode.READ_WRITE, mbb1.position(), 1024 * 1024);
mbb2.put(s2);
MappedByteBuffer mbb3 = fc.map(MapMode.READ_WRITE, mbb1.position() + mbb2.position(), 1024 * 1024);
mbb3.put(s3);
mbb1.put(s1); // overwrite mbb2
mbb1.put(s1); // overwrite mbb3
mbb1.force(); // go to file
mbb3.force(); // can this ever overwrite mbb1 in the file?
mbb2.force(); // can this ever overwrite mbb1 in the file?
Is it always last written, last in or am I missing something here?

I haven't tested any of this, so I don't know.
But, frankly, there's no guarantee on any of this ordering.
You have the mbb.force() method, but that's not the only way to write to the device, rather it just ensures that it has been written.
The VM can flush the page back to the device whenever it feels like it, using whatever schedule it deems fit, which is, naturally, extremely platform dependent (the behavior on Linux may be different than the behavior on Windows, it may even vary from Linux to Linux or Windows to Windows).
Seems to be that you should be coordinating internally to ensure that you only have one read/write buffer mapped to a specific area of a file, and manage the conflicts and overlap that way rather than relying on the operating system.
Edit: "changes done by multiple memory-mapped buffers are guaranteed to be consistent"
Simply, this means that the underlying VM, once a physical page is mapped in to a process, that mapping is shared across all of the assorted mappings performed. The thread issues is simply due to CPU memory cacheing and other issues.
So, this guarantees that all of the mappings will see the same data within an overlapping buffer. But it does not address when the buffers will actually get written to the device. Those points are still germane.
Overall it sounds like you won't have an issue if you handle any multithreading aspects correctly, and be aware that what you see in your underlying buffer may "change beneath your feet".

Java compress/decompress large files (>1gb)

I have made an application in android that lets the user compress and decompress files and I used the package java.util.zip. Everything is okay. the speed, files are totally compressed and decompressed together with the directories. The only problem is that the application is not able to compress/decompress large files (greater than 1gb).
I believe the problem is the size of my buffer. Other codes that I've seen, the value of their buffer is 1024 or 2048 or 8192 but my value of my buffer is base on the size of the chosen file (just to make it flexible). But once the user chose a large file (with a size of >8 digits), that's were the error comes out. I searched over the net and also here in this site but I can't find an answer. my problem is similar to this:
To Compress a big file in a ZIP with Java
Thanks for the future help! :)
EDIT:
Thanks for the comments and answers. It really helped a lot. I thought BUFFER in compressing/decompressing in java means the size of file so in my program, I made the buffer size flexible (buffer size = file size). Will someone please explain how buffer works so I can understand why is it okay that BUFFER has a fixed value. Also for me to figure it out why others people is telling that it is much better if the buffer size is 8k or else. Thanks a lot! :)

If you size the buffer to the size of the file, then it means that you will have OutOfMemoryError whenever the file size is too big for memory available.
Use a normal buffer size and let it do it's work - buffering the data in a streaming fashion, one chunk at a time, rather than all in one go.
For explanation, see for example the documentation of BufferedOutputStream:
The class implements a buffered output stream. By setting up such an
output stream, an application can write bytes to the underlying output
stream without necessarily causing a call to the underlying system for
each byte written.
So using a buffer is more efficient than non-buffered writing.
And from the write method:
Ordinarily this method stores bytes from the given array into this
stream's buffer, flushing the buffer to the underlying output stream
as needed. If the requested length is at least as large as this
stream's buffer, however, then this method will flush the buffer and
write the bytes directly to the underlying output stream.
Each write causes the in-memory buffer to fill up, until the buffer is full. When the buffer is full, it is flushed and cleared. If you use a very large buffer, you will cause a large amount of data to be stored in memory before flushing. If your buffer is the same size as the input file, then you are saying you need to read the whole content into memory before flushing it. Using the default buffer size is usually just fine. There will be more physical writes (flushes); you avoid exploding memory.
By allowing you to specify a specific buffer size, the API is letting you choose the right balance between memory consumption and i/o to suit your application. If you tune your application for performance, you might end up tweaking buffer size. But the default size will be reasonable for many situations.

It sounds like it would help to simply set a maximum size for the buffer, something like:
//After calculating the buffer size bufSize:
bufSize = Math.min(bufSize, MAXSIZE);

Would FileChannel.read read less bytes than specified if there's enough data?

For example I have a file whose content is：
abcdefg
then i use the following code to read 'defg'.
ByteBuffer bb = ByteBuffer.allocate(4);
int read = channel.read(bb, 3);
assert(read == 4);
Because there's adequate data in the file so can I suppose so? Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?

Can I assume that the method returns a number less than limit of the given buffer only when there aren't enough bytes in the file?
The Javadoc says:
a read might not fill the buffer
and gives some examples, and
returns the number of bytes read, possibly zero, or -1 if the channel has reached end-of-stream.
This is NOT sufficient to allow you to make that assumption.
In practice, you are likely to always get a full buffer when reading from a file, modulo the end of file scenario. And that makes sense from an OS implementation perspective, given the overheads of making a system call.
But, I can also imagine situations where returning a half empty buffer might make sense. For example, when reading from a locally-mounted remote file system over a slow network link, there is some advantage in returning a partially filled buffer so that the application can start processing the data. Some future OS may implement the read system call to do that in this scenario. If assume that you will always get a full buffer, you may get a surprise when your application is run on the (hypothetical) new platform.
Another issue is that there are some kinds of stream where you will definitely get partially filled buffers. Socket streams, pipes and console streams are obvious examples. If you code your application assuming file stream behavior, you could get a nasty surprise when someone runs it against another kind of stream ... and fails.

No, in general you cannot assume that the number of bytes read will be equal to the number of bytes requested, even if there are bytes left to be read in the file.
If you are reading from a local file, chances are that the number of bytes requested will actually be read, but this is by no means guaranteed (and won't likely be the case if you're reading a file over the network).
See the documentation for the ReadableByteChannel.read(ByteBuffer) method (which applies for FileChannel.read(ByteBuffer) as well). Assuming that the channel is in blocking mode, the only guarantee is that at least one byte will be read.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.