Java - Read File in Byte Array - java

I want to read an file of 1.5 GB into an array. Now, as it takes long time, I want to switch it to some other option. Can anybody help me,
If I preprocess the byte file into some database (or may be in other way) can I make it faster ?
Can anybody help me is there any other way to make it faster.
Actually, I have to process more than 50, 1.5GB file. So, such operation is quite expensive for me.

It depends on what you want to do.
If you only wanted to access a few random bytes, then reading into an array isn't good - a MappedByteBuffer would be better.
If you want to read all the data and sequentially process it a small portion at a time then you could stream it.
If you need to do computations that do random access of the whole dataset, particularly if you need to repeatedly read elements, then loading into an array might be sensible (but a ByteBuffer is still a candidate).
Can you show some example code or explain further?

How fast is your disk subsystem?
If you can read 40 MB per second, reading 1500 MB should take about 40 seconds. If you want to go faster than this, you need a faster disk subsystem. If you are reading from a local drive and its taking minutes, you have a tuning problem and there is not much you can doing Java to fix this because it is not the problem.
You can use a memory mapped file instead, but this will only speed up the access if you don't need all the data. If you need it all, you are limited by the speed of your hardware.

Using BufferedInputStream or InputStream is probably as fast as you can get (faster than RandomAccessFile). The largest int size is 2,147,483,647 so you're getting somewhat close there with your array of 1,610,612,736 which would also be the max size of an array.
I'd recommend you just access the file using BufferedInputStream for best speed, skip() and read() to get the data you want. Maybe have a class that implements those, is aware of its position, and takes care of the seeking for you when you send it an offset to read from. I believe you close and reopen the input stream to put it back at the beginning.
And... you may not want to save them in an array and just access them on need from the file. That might help if loading time is your killer.

Related

How can I work round the fact that java ByteBuffer size cannot change

I read binary data from a file in chunk into a ByteBuffer and then process the data in the ByteBuffer. I like using ByteBuffer because I can make use of the order() method to correctly read out Shorts and Integers.
At a later time I have to write new/modified data to the file. The idea was to use a ByteBuffer again and make use of its put methods. However I don't know easily know the size required in advance, I could work it out but it would mean parsing the data twice. Alternatively I could use ByteArrayOutputStream but then I have to write methods to deal with writing BigEndian/LittleEndian integers ectera.
What do you recommend, is there a third option ?
I can think of two (other) approaches:
If the problem is that you can't predict the file size in advanced, then don't. Instead use File.length() or Files.size(Path) to find the file size.
If you can process the file in chunks, then read a chunk, process it, and write it to a new file. Depending on your processing, you may even be able to update the ByteBuffer in-place. If not, then (presumably) the size of an output chunk is proportional to the size of an input chunk.
What do I recommend?
I'm not prepared to recommend anything without more details of the problem you are trying to solve; i.e. the nature of the processing.
If yours use case guarantees that the file won't change during the processing, then it might be a good idea to use MappedByteBuffer.

Write files into randomly non-contiguous hard disk positions

I need to do some performance tests for my program and I need to evaluate the worst IO access time to some data stored in files, so I think that the best way to evaluate this is to randomly store these data into different HD sectors in order to avoid contiguous data access and caching improvements. I think that the only way to do this is using some low-level OS commands, like dd in UNIX where you can specify the sector where you write the data, but if I'm not mistaken, this is an insecure method. Someone know a good alternative to do this?
PS: Any solution for any OS will work, the only requirement is that I have to do the tests over different data size, accessing the data through a JAVA program.
I think that the only way to do this is using some low-level OS commands
No... RandomAccessFile has .seek():
final RandomAccessFile f = new RandomAccessFile(someFile, "rw");
f.seek(someRandomLong);
f.write(...);
Now, it is of course up to you to ensure that writes don't collide with one another.
Another solution is to map the file in memory and set the buffer's position to some random position before writing.

What is maximum file size that can be handled by the JVM?

I wanted to know the maximum file size that can be read by Java code?
I wanted to handle file size of 100mb. is this possible?
If possible what are the JVM initial settings that I have to do?
Please recommend some best practice in handling file. like use ObjectInputStream,FilterInputStream etc.. use byte array to store file contents etc
What's the biggest number you can write? That's the maximum size.
The total size of file is irrelevant if you read it in chunks; there's no rule in the world that would state that you have to read in your 100 megabyte file in one go, you can read it in, say, 10 megabyte blocks instead. What really matters is how you use that incoming data and whether you need to store the product of the raw data entirely (for example, if the data is a 3D model of a building, how do you internally need to represent it) or only the relevant parts (such as finding first ten matches to some clause from a huge text file).
Since there's a lot of possible ways to handle the data, there's no all-covering blanket answer to your question.
The only maximum I know of is the reporting maximum of the length() - which is a Long. That length is 2^62 - 1, or very very large.
Java will not hold the entire file in memory at one time. If you want to hold part of the file in memory, you should use one of the "Buffered" classes (the name of the class starts with Buffered). These classes buffer part of the file for you, based on the buffer size you set.
The exact classes you should use depend on the data in the file. If you are more specific, we might be able to help you figure out which classes to use.
(One humble note: Seriously, 100mb? That's pretty small.)
There is not any max file size that can be read theoretically but I think it is Integer.MAX_VALUE because you can't initialize charBuffer's size bigger than Integer.MAX_VALUE
char[] buffer = new char[/* int size */];
char[] buffer = new char[Integer.MAX_VALUE]; // maximum char buffer
BufferedReader b = new BufferedReader(new FileReader( new File("filename")));
b.read(buffer);
There is no specific maximum file size supported by Java, it all depends on what OS you're running on. 100 megabytes wouldn't be too much of a problem, even on a 32-bit OS.
You didn't say whether you wanted to read the entire file into memory at once. You may find that you only need to process the file a part at a time. For example, a text file might be processed a line at a time, so there would be no need to load the whole file. Just read a line at a time and process each one.
If you want to read the whole file into one block of memory, then you may need to change the default heap size allocate for your JVM. Many JVMs have a default of 128 MB, which probably isn't enough to load your entire file and still have enough room to do other useful things. Check the documentation for your JVM to find out how to increase the heap size allocation.
As long as you have more than 100 MB free you should be able to load the entire file into memory at once, though you probably won't need to.
BTW: In term of what letters mean
M = Mega or 1 million for disk or 1024^2 for memory.
B = Bytes (8-bits)
b = bit e.g. 100 Mb/s
m = milli e.g. mS - milli-seconds.
A 100 milli-bits only makes sense for compressed data, but I assumed this is not what you are talking about.

File processing in java

I have a file of size 2GB which has student records in it. I need to find students based on certain attributes in each record and create a new file with results. The order of the filtered students should be same as in the original file. What's the efficient & fastest way of doing this using Java IO API and threads without having memory issues? The maxheap size for JVM is set to 512MB.
What kind of file? Text-based, like CSV?
The easiest way would be to do something like grep does: Read the file line by line, parse the line, check your filter criterion, if matched, output a result line, then go to the next line, until the file is done. This is very memory efficient, as you only have the current line (or a buffer a little larger) loaded at the same time. Your process needs to read through the whole file just once.
I do not think multiple threads are going to help much. It would make things much more complicated, and since the process seems to be I/O bound anyway, trying to read the same file with multiple threads probably does not improve throughput.
If you find that you need to do this often, and going through the file each time is too slow, you need to build some kind of index. The easiest way to do that would be to import the file into a DB (can be an embedded DB like SQLite or HSQL) first.
I wouldn't overcomplicate this until you find that the boringly simple way doesn't work for what you need. Essentially you just need to:
open input stream to 2GB file, remembering to buffer (e.g. by wrapping with BufferedInputStream)
open output stream to filtered file you're going to create
read first record from input stream, look at whatever attribute to decide if you "need" it; if you do, write it to output file
repeat for remaining records
On one of my test systems with extremely modest hardware, BufferedInputStream around a FileInputStream out of the box read about 500 MB in 25 seconds, i.e. probably under 2 minutes to process your 2GB file, and the default buffer size is basically as good as it gets (see the BufferedInputStream timings I made for more details). I imagine with state of the art hardware it's quite possible the time would be halved.
Whether you need to go to a lot of effort to reduce the 2/3 minutes or just go for a wee while you're waiting for it to run is a decision that you'll have to make depending on your requirements. I think the database option won't buy you much unless you need to do a lot of different processing runs on the same set of data (and there are other solutions to this that don't automatically mean database).
2GB for a file is huge, you SHOULD go for a db.
If you really want to use Java I/O API, then try out this: Handling large data files efficiently with Java and this: Tuning Java I/O Performance
I think you should use memory mapped files.This will help you to map the bigger file to a
smaller memory.This will act like virtual memory and as far as performance is concerned mapped files are the faster than stream write/read.

How would you change a single byte in a file?

What is the best way to change a single byte in a file using Java? I've implemented this in several ways. One uses all byte array manipulation, but this is highly sensitive to the amount of memory available and doesn't scale past 50 MB or so (i.e. I can't allocate 100MB worth of byte[] without getting OutOfMemory errors). I also implemented it another way which works, and scales, but it feels quite hacky.
If you're a java io guru, and you had to contend with very large files (200-500MB), how might you approach this?
Thanks!
I'd use RandomAccessFile, seek to the position I wanted to change and write the change.
If all I wanted to do was change a single byte, I wouldn't bother reading the entire file into memory. I'd use a RandomAccessFile, seek to the byte in question, write it, and close the file.

Categories

Resources