How would you change a single byte in a file? - java

What is the best way to change a single byte in a file using Java? I've implemented this in several ways. One uses all byte array manipulation, but this is highly sensitive to the amount of memory available and doesn't scale past 50 MB or so (i.e. I can't allocate 100MB worth of byte[] without getting OutOfMemory errors). I also implemented it another way which works, and scales, but it feels quite hacky.
If you're a java io guru, and you had to contend with very large files (200-500MB), how might you approach this?
Thanks!

I'd use RandomAccessFile, seek to the position I wanted to change and write the change.

If all I wanted to do was change a single byte, I wouldn't bother reading the entire file into memory. I'd use a RandomAccessFile, seek to the byte in question, write it, and close the file.

Related

How can I work round the fact that java ByteBuffer size cannot change

I read binary data from a file in chunk into a ByteBuffer and then process the data in the ByteBuffer. I like using ByteBuffer because I can make use of the order() method to correctly read out Shorts and Integers.
At a later time I have to write new/modified data to the file. The idea was to use a ByteBuffer again and make use of its put methods. However I don't know easily know the size required in advance, I could work it out but it would mean parsing the data twice. Alternatively I could use ByteArrayOutputStream but then I have to write methods to deal with writing BigEndian/LittleEndian integers ectera.
What do you recommend, is there a third option ?
I can think of two (other) approaches:
If the problem is that you can't predict the file size in advanced, then don't. Instead use File.length() or Files.size(Path) to find the file size.
If you can process the file in chunks, then read a chunk, process it, and write it to a new file. Depending on your processing, you may even be able to update the ByteBuffer in-place. If not, then (presumably) the size of an output chunk is proportional to the size of an input chunk.
What do I recommend?
I'm not prepared to recommend anything without more details of the problem you are trying to solve; i.e. the nature of the processing.
If yours use case guarantees that the file won't change during the processing, then it might be a good idea to use MappedByteBuffer.

NegativeArraySizeException ANTLRv4

I have a 10gb file and I need to parse it in Java, whereas the following error arises when I attempt to do this.
java.lang.NegativeArraySizeException
at java.util.Arrays.copyOf(Arrays.java:2894)
at org.antlr.v4.runtime.ANTLRInputStream.load(ANTLRInputStream.java:123)
at org.antlr.v4.runtime.ANTLRInputStream.<init>(ANTLRInputStream.java:86)
at org.antlr.v4.runtime.ANTLRInputStream.<init>(ANTLRInputStream.java:82)
at org.antlr.v4.runtime.ANTLRInputStream.<init>(ANTLRInputStream.java:90)
How can I solve this problem properly? How can I adjust such an input stream to handle this error?
It looks like ANTLR v4 has a pervasive hard-wired limitation that input stream size is less that 2^31 characters. Removing this limitation would not be a small task.
Take a look at the source code for the ANTLRInputStream class - here.
As you can see, it attempts to hold the entire stream contents in a single char[]. That ain't going to work ... for huge input files. But simply fixing that by buffering the data in a larger data structure isn't going to be the answer either. If you look further down the file, there are a number of other methods that use int as the type for indexing the stream. They would need to be changed to use long ... and the changes will ripple out.
How can I solve this problem properly? How can I adjust such an input stream to handle this error?
Two approaches spring to mind:
Create your own version of ANTLR that supports large input files. This is a non-trivial project. I expect that the 32 bit assumption reaches into the code that ANTLR generates, etc.
Split your input files into smaller files before you attempt to parse them. Whether this is viable depends on the input syntax.
My recommendation would be the 2nd alternative. The problem with "supporting" huge input files (by in-memory buffering) is that it is going to be inefficient and memory wasteful ... and it ultimately doesn't scale.
You could also create an issue here, or ask on antlr-discussion.
i never stumbled upon this error, but i guess your array gets too big and it's index overflows (e.g., the integer wraps around and becomes negative). use another data structure, and most importantly, don't load all of the file at once (use lazy loading instead, that means, load only those parts that are being accessed)
I hope this will help http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
You might want to have some kind of buffer to read big files.

Java - Read File in Byte Array

I want to read an file of 1.5 GB into an array. Now, as it takes long time, I want to switch it to some other option. Can anybody help me,
If I preprocess the byte file into some database (or may be in other way) can I make it faster ?
Can anybody help me is there any other way to make it faster.
Actually, I have to process more than 50, 1.5GB file. So, such operation is quite expensive for me.
It depends on what you want to do.
If you only wanted to access a few random bytes, then reading into an array isn't good - a MappedByteBuffer would be better.
If you want to read all the data and sequentially process it a small portion at a time then you could stream it.
If you need to do computations that do random access of the whole dataset, particularly if you need to repeatedly read elements, then loading into an array might be sensible (but a ByteBuffer is still a candidate).
Can you show some example code or explain further?
How fast is your disk subsystem?
If you can read 40 MB per second, reading 1500 MB should take about 40 seconds. If you want to go faster than this, you need a faster disk subsystem. If you are reading from a local drive and its taking minutes, you have a tuning problem and there is not much you can doing Java to fix this because it is not the problem.
You can use a memory mapped file instead, but this will only speed up the access if you don't need all the data. If you need it all, you are limited by the speed of your hardware.
Using BufferedInputStream or InputStream is probably as fast as you can get (faster than RandomAccessFile). The largest int size is 2,147,483,647 so you're getting somewhat close there with your array of 1,610,612,736 which would also be the max size of an array.
I'd recommend you just access the file using BufferedInputStream for best speed, skip() and read() to get the data you want. Maybe have a class that implements those, is aware of its position, and takes care of the seeking for you when you send it an offset to read from. I believe you close and reopen the input stream to put it back at the beginning.
And... you may not want to save them in an array and just access them on need from the file. That might help if loading time is your killer.

Fast way to compress binary data?

I have some binary data (pixel values) in a int[] (or a byte[] if you prefer) that I want to write to disk in an Android app. I only want to use a small amount of processing time but want as much compression as I can for this. What are my options?
In many cases the array will contain lots of consecutive zeros so something simple and fast like RLE compression would probably work well. I can't see any Android API functions for this though. If I have to loop over the array in Java, this will be very slow as there is no JIT on most Android devices. I could use the NDK but I'd rather avoid this if I can.
DeflatorOutputStream takes ~25 ms to compress 1 MB in Java. Its a native method so a JIT should not make much difference.
Do you have a requirement which says 0.2s or 0.5s is too slow?
Can you do it in a background thread so the user doesn't notice how long it takes?
GZIP is based on the Deflator + CRC32 so is likely to be much the same or slightly slower.
Deflator has several modes. The DEFAULT_STRATEGY is fastest in Java, but simpler compressions such as HUFFMAN_ONLY might be faster for you.
Android has Java's DeflaterOutputStream. Would that work?
Pass the byte array to
http://download.oracle.com/javase/6/docs/api/java/io/FileWriter.html
and chain
http://download.oracle.com/javase/1.4.2/docs/api/java/util/zip/GZIPOutputStream.html
to it
then when you need to read the data back in do the reverse
http://download.oracle.com/javase/1.4.2/docs/api/java/io/FileReader.html
and chain
http://download.oracle.com/javase/1.4.2/docs/api/java/util/zip/GZIPInputStream.html
Depending on the size of the file your saving you will see some compression Gzip is good like that, if your not seeing much of a trade of just write the data uncompressed using a buffered writer(That should be the fastest). Also if you do gzip it using a buffered writer reader could also speed it up a bit.
I've had to solve basically the same problem on another platform and my solution was to use a modified LZW compression. First, do some difference filtering (similar to PNG) on the 32bpp image. This will turn most of the image to black if there are large areas of common color. Then use a generic GIF compression algorithm treating the filtered image as if it's 8bpp. You'll get decent compression and it works very quickly. This will need to run in native code (NDK). It's really quite easy to get native code working on Android.
Random thought: if it's image data, try saving it as png. Standard java has it, i'm sure android will too, and probably optimized with native code. It has pretty good compression and it's lossless.

Advice on handling large data volumes

So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once.
Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading.
Should I load everything into memory all at once?
If not, is opening what's a good way of loading the data partially?
What are some Java-relevant efficiency tips?
So then what if the processing requires jumping around in the data for multiple files and multiple buffers? Is constant opening and closing of binary files going to become expensive?
I'm a big fan of 'memory mapped i/o', aka 'direct byte buffers'. In Java they are called Mapped Byte Buffers are are part of java.nio. (Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmatically as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.
I suggest this approach because a) it works for me, and b) it will let you focus on your algorithm and let the JVM, OS and hardware deal with the performance optimization. All to frequently, they know what is best more so than us lowly programmers. ;)
How would you use MBBs in your context? Just create an MBB for each of your files and read them as you see fit. You will only need to store your results. .
BTW: How much data are you dealing with, in GB? If it is more than 3-4GB, then this won't work for you on a 32-bit machine as the MBB implementation is defendant on the addressable memory space by the platform architecture. A 64-bit machine & OS will take you to 1TB or 128TB of mappable data.
If you are thinking about performance, then know Kirk Pepperdine (a somewhat famous Java performance guru.) He is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips and other Java performance related things.
You might want to have a look at the entries in the Wide Finder Project (do a google search for "wide finder" java).
The Wide finder involves reading over lots of lines in log files, so look at the Java implementations and see what worked and didn't work there.
You could convert to binary, but then you have 1+ something copies of the data, if you need to keep the original around.
It may be practical to build some kind of index on top of your original ascii data, so that if you need to go through the data again you can do it faster in subsequent times.
To answer your questions in order:
Should I load everything into memory all at once?
Not if don't have to. for some files, you may be able to, but if you're just processing sequentially, just do some kind of buffered read through the things one by one, storing whatever you need along the way.
If not, is opening what's a good way of loading the data partially?
BufferedReaders/etc is simplest, although you could look deeper into FileChannel/etc to use memorymapped I/O to go through windows of the data at a time.
What are some Java-relevant efficiency tips?
That really depends on what you're doing with the data itself!
Without any additional insight into what kind of processing is going on, here are some general thoughts from when I have done similar work.
Write a prototype of your application (maybe even "one to throw away") that performs some arbitrary operation on your data set. See how fast it goes. If the simplest, most naive thing you can think of is acceptably fast, no worries!
If the naive approach does not work, consider pre-processing the data so that subsequent runs will run in an acceptable length of time. You mention having to "jump around" in the data set quite a bit. Is there any way to pre-process that out? Or, one pre-processing step can be to generate even more data - index data - that provides byte-accurate location information about critical, necessary sections of your data set. Then, your main processing run can utilize this information to jump straight to the necessary data.
So, to summarize, my approach would be to try something simple right now and see what the performance looks like. Maybe it will be fine. Otherwise, look into processing the data in multiple steps, saving the most expensive operations for infrequent pre-processing.
Don't "load everything into memory". Just perform file accesses and let the operating system's disk page cache decide when you get to actually pull things directly out of memory.
This depends a lot on the data in the file. Big mainframes have been doing sequential data processing for a long time but they don't normally use random access for the data. They just pull it in a line at a time and process that much before continuing.
For random access it is often best to build objects with caching wrappers which know where in the file the data they need to construct is. When needed they read that data in and construct themselves. This way when memory is tight you can just start killing stuff off without worrying too much about not being able to get it back later.
You really haven't given us enough info to help you. Do you need to load each file in its entiretly in order to process it? Or can you process it line by line?
Loading an entire file at a time is likely to result in poor performance even for files that aren't terribly large. Your best bet is to define a buffer size that works for you and read/process the data a buffer at a time.
I've found Informatica to be an exceptionally useful data processing tool. The good news is that the more recent versions even allow Java transformations. If you're dealing with terabytes of data, it might be time to pony up for the best-of-breed ETL tools.
I'm assuming you want to do something with the results of the processing here, like store it somewhere.
If your numerical data is regularly sampled and you need to do random access consider to store them in a quadtree.
I recommend strongly leveraging Regular Expressions and looking into the "new" IO nio package for faster input. Then it should go as quickly as you can realistically expect Gigabytes of data to go.
If at all possible, get the data into a database. Then you can leverage all the indexing, caching, memory pinning, and other functionality available to you there.
If you need to access the data more than once, load it into a database. Most databases have some sort of bulk loading utility. If the data can all fit in memory, and you don't need to keep it around or access it that often, you can probably write something simple in Perl or your favorite scripting language.

Categories

Resources