I need to perform a simple grep and other manipulations on large files in Java. I am not that familiar with the Java NIO utilities, but I am assuming that is what I need to use. What resources or helpful tips do you have for reading/writing large files. Also, I am working on a SWT application and need to display parts of that data within a text area on a GUI.
java.io.RandomAccessFile uses long for file-pointer offset so should be able to cope. However, you should read a chunk at a time otherwise overheads will be high. FileInputStream works similarly.
Java NIO shouldn't be too difficult. You don't need to mess around with Selectors or similar. In fact, prior to JDK7 you can't select with files. However, avoid mapping files. There is no unmap, so if you try to do it lots you'll run out of address space on 32-bit systems, or run into other problems (NIO does attempt to call GC, but it's a bit of a hack).
If all you are doing is reading the entire file a chunk at a time, with no special processing, then nio and java.io.RandomAccessFile are probably overkill. Just read and process the content of the file a block at a time. Ensure that you use a BufferedInputStream or BufferedReader.
If you have to read the entire file to do what you are doing, and you read only one file at a time, then you will gain little benefit from nio.
Maybe a little bit off topic: Have a look on VFS by apache. It's originally meant to be a library for hiding the ftp-http-file-whatever system behind a file system facade from your application's point of view. I mentioning it here because I have positive experience with accessing large files (via ftp) for searching, reading, copying etc. (large in that context means > 15MB) with this library.
Related
I'm testing data structure performance with very large data.
As a temporary workaround (see here) I want to write memory to disk.
I want to test with very big datasets - how can I make it so that when the java VM runs out of memory it writes some of it to disk?
Since we're talking about temporary fixes here you could always increase your page file if you need a little extra space (swap file in most linux distros)
Here's a link from Microsoft:
http://windows.microsoft.com/en-us/windows-vista/change-the-size-of-virtual-memory
Linux:
http://www.cyberciti.biz/faq/linux-add-a-swap-file-howto/
Now let me say that this isn't a good long term fix, but I understand that sometimes developers just need to make it work. If this is something that will ever see a production environment you may want to look at a tool like Hadoop. It allows you to distribute your data processing over multiple JVM's--a tool built for a "big data" application like the one you're describing
Maybe you can use stream, or some buffered one. I think that will be the best choice for testing such structure. If you can read from disk using stream and that will be not make any additional objects(only that which are necessary) so you can have all jvm memory for your structure. But maybe you can describe your problem more?
I was reading about how you can use the java nio library to take advantage of file transfer/buffering at the O/S level which is called 'zero copy'.
What are the differences in how you create/write to files then?
Are there any drawbacks to using zero-copy?
zero copy means that your program will not transfer the data from the kernel space to the user space and so on. this is faster
nice article can be found here:
https://developer.ibm.com/articles/j-zerocopy/
Zero copy is a technique where the application is no longer the 'middleman' in transferring data from a disk to the socket. Applications that use zero copy request that the kernel copy the data directly from the disk file to the socket, without going through the application, which improves performance and reduces context switches.
It all depends on what the application will do with the data it reads from disks. If it is a web application serving a lot of static content by reading files and relaying them over sockets, then zero copy is the way to go in order to get better performance. However, if the application is using the data locally (either crunching it in some way and then writing it back, or displaying it locally without persisting it back), you would not use zero copy.
This IBM DeveloperWorks article about zero copy is a good read.
Other ways of file I/O in java are via the use of Stream classes based on the type of file you would want to read/write. This involves both buffered and unbuffered streams, although usually buffered streams promise better performance since they cause less I/O seek cycles and hence lesser context switches.
I am writing a media transcoding server in which I would need to move files in the filesystem and till now I am in the dilemma of whether using java renameTo can be replaced by something else that would give me better performance. I was considering using exec("mv file1 file2") but that would be my last bet.
Anyone has had similar experiences or can help me find a solution?
First of all, renameTo is likely just wrapping a system call.
Secondly, moving a file does not involve copying any data from the file itself (at least, in unix). All that happens is that the link from the old directory is removed, and a link from the new directory is added. I don't think you're going to find any performance improvements here.
I don't think that using the default methods for file has a (mentionable) performance penalty as most of this JVMtoOS functions are wrapping native calls already.
The only case where an exec would be needed is if you wanted to do something with different rights than the program or use a special tool to copy/move the file. (e.g. smart-move when ntfs-junctions are involved)
If rename is a significant performance bottleneck, then you need to improve your hardware as this is your main contraint. The software is a trivial portion of the time spent and optimising it will make little difference.
What is your disk confiugration? How is it optimised for writes?
I'm using a GZIPInputStream in my program, and I know that the performance would be helped if I could get Java running my program in parallel.
In general, is there a command-line option for the standard VM to run on many cores? It's running on just one as it is.
Thanks!
Edit
I'm running plain ol' Java SE 6 update 17 on Windows XP.
Would putting the GZIPInputStream on a separate thread explicitly help? No! Do not put the GZIPInputStream on a separate thread! Do NOT multithread I/O!
Edit 2
I suppose I/O is the bottleneck, as I'm reading and writing to the same disk...
In general, though, is there a way to make GZIPInputStream faster? Or a replacement for GZIPInputStream that runs parallel?
Edit 3
Code snippet I used:
GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(INPUT_FILENAME));
DataInputStream in = new DataInputStream(new BufferedInputStream(gzip));
AFAIK the action of reading from this stream is single-threaded, so multiple CPUs won't help you if you're reading one file.
You could, however, have multiple threads, each unzipping a different file.
That being said, unzipping is not particularly calculation intensive these days, you're more likely to be blocked by the cost of IO (e.g., if you are reading two very large files in two different areas of the HD).
More generally (assuming this is a question of someone new to Java), Java doesn't do things in parallel for you. You have to use threads to tell it what are the units of work that you want to do and how to synchronize between them. Java (with the help of the OS) will generally take as many cores as is available to it, and will also swap threads on the same core if there are more threads than cores (which is typically the case).
PIGZ = Parallel Implementation of GZip is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data. http://www.zlib.net/pigz/ It's not Java yet--- any takers. Of course the world needs it in Java.
Sometimes the compression or decompression is a big CPU-consumer, though it helps the I/O not be the bottleneck.
See also Dataseries (C++) from HP Labs. PIGZ only parallelizes the compression, while Dataseries breaks the output into large compressed blocks, which are decompressible in parallel. Also has a number of other features.
Wrap your GZIP streams in Buffered streams, this should give you a significant performance increase.
OutputStream out = new BufferedOutputStream(
new GZIPOutputStream(
new FileOutputStream(myFile)
)
)
And likewise for the input stream. Using the buffered input/output streams reduces the number of disk reads.
I'm not seeing any answer addressing the other processing of your program.
If you're just unzipping a file, you'd be better off simply using the command line gunzip tool; but likely there's some processing happening with the files you're pulling out of that stream.
If you're extracting something that comes in reasonably sized chunks, then your processing of those chunks should be happening in a separate thread from the unzipping.
You could manually start a Thread on each large String or other block of data; but since Java 1.6 or so you'd be better of with one of the fancy new classes in java.util.concurrent, such as a ThreadPoolExecutor.
Update
It's not clear to me from the question and other comments whether you really ARE just extracting files using Java. If you really, really think you should try to compete with gunzip, then you can probably gain some performance by using large buffers; i.e. work with a buffer of, say, 10 MB (binary, not decimal! - 1048576), fill that in a single gulp and write it to disk likewise. That will give your OS a chance to do some medium-scale planning for disk space, and you'll need fewer system-level calls too.
Compression seems like a hard case for parallelization because the bytes emitted by the compressor are a non-trivial function of the previous W bytes of input, where W is the window size. You can obviously break a file into pieces and create independent compression streams for each of the pieces that run in their own threads. You'll may need to retain some compression metadata so the decompressor knows how to put the file back together.
compression and decompression using gzip is a serialized process. to use multiple threads you would have to make a custom program to break up the input file into many streams and then a custom program to decompress and join them back together. either way IO is going to be a bottle neck WAY before CPU usage is.
Run multiple VMs. Each VM is a process and you should be able to run at least three processes per core without suffering any drop in performance. Of course, your application would have to be able to leverage multiprocessing in order to benefit. There is no magic bullet which is why you see articles in the press moaning about how we don't yet know how to use multicore machines.
However, there are lots of people out there who have structured their applications into a master which manages a pool of worker processes and parcels out work packages to them. Not all problems are amenable to being solved this way.
I think it is a mistake to assume that multithreading IO is always evil. You probably need to profile your particular case to be sure, because:
Recent operating systems use the currently free memory for the cache, and your files may actually not be on the hard drive when you are reading them.
Recent hard drives like SSD have much faster access times so changing the reading location is much less an issue.
The question is too general to assume we are reading from a single hard drive.
You may need to tune your read buffer, to make it large enough to reduce the switching costs. On the boundary case, one can read all files into memory and decompress there in parallel - faster and no any loss on IO multithreading. However something less extreme may also work better.
You also do not need to do anything special to use multiple available cores on JRE. Different threads will normally use different cores as managed by the operating system.
You can't parallelize the standard GZipInputStream, it is single threaded, but you can pipeline decompression and processing of the decompressed stream into different threads, i.e. set up the GZipInputStream as a producer and whatever processes it as a consumer, and connect them with a bounded blocking queue.
I support a legacy Java application that uses flat files (plain text) for persistence. Due to the nature of the application, the size of these files can reach 100s MB per day, and often the limiting factor in application performance is file IO. Currently, the application uses a plain ol' java.io.FileOutputStream to write data to disk.
Recently, we've had several developers assert that using memory-mapped files, implemented in native code (C/C++) and accessed via JNI, would provide greater performance. However, FileOutputStream already uses native methods for its core methods (i.e. write(byte[])), so it appears a tenuous assumption without hard data or at least anecdotal evidence.
I have several questions on this:
Is this assertion really true?
Will memory mapped files always
provide faster IO compared to Java's
FileOutputStream?
Does the class MappedByteBuffer
accessed from a FileChannel provide
the same functionality as a native
memory mapped file library accessed
via JNI? What is MappedByteBuffer
lacking that might lead you to use a
JNI solution?
What are the risks of using
memory-mapped files for disk IO in a production
application? That is, applications
that have continuous uptime with
minimal reboots (once a month, max).
Real-life anecdotes from production
applications (Java or otherwise)
preferred.
Question #3 is important - I could answer this question myself partially by writing a "toy" application that perf tests IO using the various options described above, but by posting to SO I'm hoping for real-world anecdotes / data to chew on.
[EDIT] Clarification - each day of operation, the application creates multiple files that range in size from 100MB to 1 gig. In total, the application might be writing out multiple gigs of data per day.
Memory mapped I/O will not make your disks run faster(!). For linear access it seems a bit pointless.
A NIO mapped buffer is the real thing (usual caveat about any reasonable implementation).
As with other NIO direct allocated buffers, the buffers are not normal memory and wont get GCed as efficiently. If you create many of them you may find that you run out of memory/address space without running out of Java heap. This is obviously a worry with long running processes.
You might be able to speed things up a bit by examining how your data is being buffered during writes. This tends to be application specific as you would need an idea of the expected data writing patterns. If data consistency is important, there will be tradeoffs here.
If you are just writing out new data to disk from your application, memory mapped I/O probably won't help much. I don't see any reason you would want to invest time in some custom coded native solution. It just seems like too much complexity for your application, from what you have provided so far.
If you are sure you really need better I/O performance - or just O performance in your case, I would look into a hardware solution such as a tuned disk array. Throwing more hardware at the problem is often times more cost effective from a business point of view than spending time optimizing software. It is also usually quicker to implement and more reliable.
In general, there are a lot of pitfalls in over optimization of software. You will introduce new types of problems to your application. You might run into memory issues/ GC thrashing which would lead to more maintenance/tuning. The worst part is that many of these issues will be hard to test before going into production.
If it were my app, I would probably stick with the FileOutputStream with some possibly tuned buffering. After that I'd use the time honored solution of throwing more hardware at it.
From my experience, memory mapped files perform MUCH better than plain file access in both real time and persistence use cases. I've worked primarily with C++ on Windows, but Linux performances are similar, and you're planning to use JNI anyway, so I think it applies to your problem.
For an example of a persistence engine built on memory mapped file, see Metakit. I've used it in an application where objects were simple views over memory-mapped data, the engine took care of all the mapping stuff behind the curtains. This was both fast and memory efficient (at least compared with traditional approaches like those the previous version used), and we got commit/rollback transactions for free.
In another project I had to write multicast network applications. The data was send in randomized order to minimize the impact of consecutive packet loss (combined with FEC and blocking schemes). Moreover the data could well exceed the address space (video files were larger than 2Gb) so memory allocation was out of question. On the server side, file sections were memory-mapped on demand and the network layer directly picked the data from these views; as a consequence the memory usage was very low. On the receiver side, there was no way to predict the order into which packets were received, so it has to maintain a limited number of active views on the target file, and data was copied directly into these views. When a packet had to be put in an unmapped area, the oldest view was unmapped (and eventually flushed into the file by the system) and replaced by a new view on the destination area. Performances were outstanding, notably because the system did a great job on committing data as a background task, and real-time constraints were easily met.
Since then I'm convinced that even the best fine-crafted software scheme cannot beat the system's default I/O policy with memory-mapped file, because the system knows more than user-space applications about when and how data must be written. Also, what is important to know is that memory mapping is a must when dealing with large data, because the data is never allocated (hence consuming memory) but dynamically mapped into the address space, and managed by the system's virtual memory manager, which is always faster than the heap. So the system always use the memory optimally, and commits data whenever it needs to, behind the application's back without impacting it.
Hope it helps.
As for point 3 - if the machine crashes and there are any pages that were not flushed to disk, then they are lost. Another thing is the waste of the address space - mapping a file to memory consumes address space (and requires contiguous area), and well, on 32-bit machines it's a bit limited. But you've said about 100MB - so it should not be a problem. And one more thing - expanding the size of the mmaped file requires some work.
By the way, this SO discussion can also give you some insights.
If you write fewer bytes it will be faster. What if you filtered it through gzipoutputstream, or what if you wrote your data into ZipFiles or JarFiles?
As mentioned above, use NIO (a.k.a. new IO). There's also a new, new IO coming out.
The proper use of a RAID hard drive solution would help you, but that would be a pain.
I really like the idea of compressing the data. Go for the gzipoutputstream dude! That would double your throughput if the CPU can keep up. It is likely that you can take advantage of the now-standard double-core machines, eh?
-Stosh
I did a study where I compare the write performance to a raw ByteBuffer versus the write performance to a MappedByteBuffer. Memory-mapped files are supported by the OS and their write latencies are very good as you can see in my benchmark numbers. Performing synchronous writes through a FileChannel is approximately 20 times slower and that's why people do asynchronous logging all the time. In my study I also give an example of how to implement asynchronous logging through a lock-free and garbage-free queue for ultimate performance very close to a raw ByteBuffer.