I use MappedByteBuffers to achieve thread safety between readers and writers of a file via volatile variables (writer updates position and readers read the writer's position) (this is a file upload system, the the incoming file is a stream, if that matters). There are more tricks, obviously (sparse files, power of two mapping growth), but it all boils down to that.
I can't find a faster way to write to a file while concurrently reading the same without caching the same completely in memory (which I cannot do due to shear size).
Is there any other method of IO that guarantees visibility within the same process for readers to written bytes? MappedByteBuffer makes its guarantees, indirectly, via the Java Memory Model, and I'd expect any other solution to do the same (read: non platform specific and more).
Is this the fastest way? Am I missing something in the docs?
I did some tests quite a few years ago on what was then decent hardware, and MappedByteBuffer was about 20% faster than any other I/O technique. It does have the disadvantage for writing that you need to know the file size in advance.
Related
Which of the two would be the best choice and in which circumstance?
Clearly there is no sense in using a file channel for a very small file. Besides that, what are the pro and cons of the two input/output means.
Thanks a lot in advance.
FileChannel has many features missing in java.io: it is interruptible, it can move position within the file, it can lock a file, etc. And it can be faster than old IO, especially when it uses direct byte buffers, here is an explanation from ByteBuffer API:
byte buffer is either direct or non-direct. Given a direct byte buffer, the Java virtual machine will make a best effort to perform native I/O operations directly upon it. That is, it will attempt to avoid copying the buffer's content to (or from) an intermediate buffer before (or after) each invocation of one of the underlying operating system's native I/O operations.
If you need none of the above features go with streams, you'll get a shorter code.
My application requires concurrent access to a data file using memory mapping. My goal is to make it scalable in a shared memory system. After studied the source code of memory mapped file library implementation, I cannot figure out:
Is it legal to read from a MappedByteBuffer in multiple threads? Does get block other get at OS (*nix) level?
If a thread put into a MappedByteBuffer, is the content immediately visible to another thread calling get?
Thank you.
To clarify a point: The threads are using a single instance of MappedByteBuffer, not multiple instances.
Buffers are not thread safe and their access should be controlled by appropriate synchronisation; see the Thread Safety section in http://docs.oracle.com/javase/6/docs/api/java/nio/Buffer.html . ByteBuffer is a subclass of the Buffer class and therefore has the same thread safety issue.
Trying to make scalable the use of memory mapped files in a shared memory system looks highly suspicious to me. The use of memory mapped files is for performance. When you step into shared systems, looking for performance should be one thing to give a low priority at. Not that you should look for a slow system but you will have so many other problems that simply make it work should be your first (and only?) priority at the beginning. I won't be surprised if at the end you will need to replace your concurrent access to a data file using memory mapping with something else.
For some ideas like the use of an Exchanger, see Can multiple threads see writes on a direct mapped ByteBuffer in Java? and Options to make Java's ByteBuffer thread safe .
I don't understand, what that Buffer classes are for. Aren't they for buffering? I think this should mean that one buffer object should allow both read and write it simultaneously and independently. Nevertheless it is not so: buffer allows only one position, single one for reading and writing. This means that if I wrote something into the buffer with relative put() then I can't read anything sensitive with relative get(). Also if I will call put() and get() interchangeably I will get a delirium.
So are there any usage patterns (samples) for buffers? So that it would be evident that those buffers are somehow better than conventional arrays?
ByteBuffer are used for read and writing data, you can get/put many primitive type and control the endianess. They can be a wrapper for direct memory (off heap) and memory mapped files (also off heap)
They can be used for performance (as they can access a long or double natively without assembling bytes together), direct byte buffers can read/write data without an additional copy into "Java" memory. memory mapped files can be extended to the size of your disk space, allowing you to use lots of memory without impacting your GC times.
I have a program that generates a lot of data and puts it in a queue to write but the problem is its generating data faster than I'm currently writing(causing it to max memory and start to slow down). Order does not matter as I plan to parse the file later.
I looked around a bit and found a few questions that helped me design my current process(but I still find it slow). Here's my code so far:
//...background multi-threaded process keeps building the queue..
FileWriter writer = new FileWriter("foo.txt",true);
BufferedWriter bufferWritter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWritter.newLine();
bufferWritter.write(data);
}
bufferWritter.close();
I'm pretty new to programming so I maybe assessing this wrong(maybe a hardware issue as I'm using EC2), but is there a to very quickly dump the queue results into a file or if my approach is okay can I improve it somehow? As order does not matter, does it make more sense to write to multiple files on multiple drives? Will threading make it faster?,etc..I'm not exactly sure the best approach and any suggestions would be great. My goal is to save the results of the queue(sorry no outputting to /dev/null :-) and keep memory consumption as low as possible for my app(I'm not 100% sure but the queue fills up 15gig, so I'm assuming it'll be a 15gig+ file).
Fastest way to write huge data in text file Java (realized I should use buffered writer)
Concurrent file write in Java on Windows (made me see that maybe multi-threading writes wasn't a great idea)
Looking at that code, one thing that springs to mind is character encoding. You're writing strings, but ultimately, it's bytes that go to the streams. A writer character-to-byte encoding under the hood, and it's doing it in the same thread that is handling writing. That may mean that there is time being spent encoding that is delaying writes, which could reduce the rate at which data is written.
A simple change would be to use a queue of byte[] instead of String, do the encoding in the threads which push onto the queue, and have the IO code use a BufferedOutputStream rather than a BufferedWriter.
This may also reduce memory consumption, if the encoded text takes up less than two bytes per character on average. For latin text and UTF-8 encoding, this will usually be true.
However, i suspect it's likely that you're simply generating data faster than your IO subsystem can handle it. You will need to make your IO subsystem faster - either by using a faster one (if you're on EC2, perhaps renting a faster instance, or writing to a different backend - SQS vs EBS vs local disk, etc), or by ganging several IO subsystems together in parallel somehow.
Yes, writing multiple files on multiple drives should help, and if nothing else is writing to those drives at the same time, performance should scale linearly with the number of drives until I/O is no longer the bottleneck. You could also try a couple other optimizations to boost performance even more.
If you're generating huge files and the disk simply can't keep up, you can use a GZIPOutputStream to shrink the output--which, in turn, will reduce the amount of disk I/O. For non-random text, you can usually expect a compression ratio of at least 2x-10x.
//...background multi-threaded process keeps building the queue..
OutputStream out = new FileOutputStream("foo.txt",true);
OutputStreamWriter writer = new OutputStreamWriter(new GZIPOutputStream(out));
BufferedWriter bufferWriter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWriter.newLine();
bufferWriter.write(data);
}
bufferWriter.close();
If you're outputting regular (i.e., repetitive) data, you might also want to consider switching to a different output format--for example, a binary encoding of the data. Depending on the structure of your data, it might be more efficient to store it in a database. If you're outputting XML and really want to stick to XML, you should look into a Binary XML format, such as EXI or Fast InfoSet.
I guess as long as you produce your data out of calculations and do not load your data from another data source, writing will always be slower than generating your data.
You can try writing your data in multiple files (not in the same file -> due to synchronization problems) in multiple threads (but I guess that will not fix your problem).
Is it possible for you to wait for the writing part of your application to finish its operation and continue your calculations?
Another approach is:
Do you empty your queue? Does solutions.poll() reduce your solutions queue?
writing to different files using multiple threads is a good idea. Also, you should look into setting the BufferedWriters buffer size, which you can do from the constructor. Try initializing with a 10 Mb buffer and see if that helps
Reading Thinking in Java 4th ed. I've got some doubts about I/O operations performance:
I've read that it's better to "wrap" InputStream objects in BufferedInputStream, but in my mind I can't see any difference. Isn't i.e. file operations already buffered? What's the advantages of file buffered write?
The system's IO buffering is on a different level than the Buffered*putStream.
Each call on FileOutputStream.write(...) induces a native method call (which is typically more costly than a java-internal call), and then a context switch to the OS' kernel to do the actual writing. Even if the kernel (or the file system driver or the harddisk controller or the harddisk itself) is doing more buffering, these costs will occur.
By wrapping a BufferedOutputStream around this, we will call the native write method only much less often, thus allowing much higher throughput.
(The same is valid for other types of IO, of course, I just used FileOutputStream as an example.)
Isn't i.e. file operations already buffered?
Maybe, maybe not - depending on the OS, the HD used, the way of access (e.g. reading big consecutive blocks vs randomly accessing small blocks all over the place), etc. In the worst case, adding a BufferedInputStream probably won't harm performance noticeably. In the best case, it can improve it by magnitudes (replacing many little file accesses by one big read/write).
An InputStream will only request as much data as you request, so if you request 1000 characters one character at a time, that will turn out to be 1000 seperate disk accesses, which will become pretty slow.
A BufferedInputStream however will request data from the InputStream in larger chunks, thus reducing the need for seperate disk accesses.
The same goes for output, instead of writing every character seperately, there are fewer physical disk writes with a BufferedOutputStream.