Reading certain number of bytes into ByteBuffer

Reading certain number of bytes into ByteBuffer - java

I have a binary file of 10MB. I need to read it in chunks of different size (e.g 300, 273 bytes). For reading I use FileChannel and ByteBuffer. Right now for each iteration of reading I allocate new ByteBuffer of size, that I need to read.
Is there possible to allocate only once (lets say 200 KB) for ByteBuffer and read into it (300 , 273 bytes etc. )? I will not read more than 200KB at once. The entire file must be read.
UPD
public void readFile (FileChannel fc, int amountOfBytesToRead)
{
ByteBuffer bb= ByteBuffer.allocate(amountOfBytesToRead);
fc.read(bb);
bb.flip();
// do something with bytes
bb = null;
}
I can not read whole file at once due to memory constraints. That's why I performing reading in chunks. Efficiency is also very important (that is why I don't want to use my current approach with multiple allocations). Thanks

Declare several ByteBuffers of the sizes you need and use scatter-read: read(ByteBuffer[] dsts, ...).
Or forget about NIO and use DataInputStream,readFully(). If you put a BufferedInputStream underneath you won't suffer any performance loss: it may even be faster.

Related

how to write a fixed number of bytes from a bytebuffer into the file in java

I allocate a big bytebuffer and read in some information (not use all the space associated with this buffer).
Then I want to exactly input the used bytes into a file after filp();
what should I do.
ByteBuffer buffer = new ByteBuffer.allocate(1024);
buffer.getInt(1);
buffer.getInt(2);
buffer.flip();
for example, after above code, I just write fc.write(buffer)?
How I tell the computer that I only need 8 bytes to input to the fc?
thanks!

Yes you can.
See the FileChannel
You need not tell the computer regarding the number of bytes to be written, the filechannel will take care of it itself.
It will write from 0 to limit() bytes.
Do this after you flip the buffer, you should be fine.

What is the fastest way to load a big 2D int array from a file?

I'm loading a 2D array from file, it's 15,000,000 * 3 ints big (it will be 40,000,000 * 3 eventually). Right now, I use dataInputStream.readInt() to sequentially read the ints. It takes ~15 seconds. Can I make it significantly (at least 3x) faster or is this about as fast as I can get?

Yes, you can. From benchmark of 13 different ways of reading files:
If you have to pick the fastest approach, it would be one of these:
FileChannel with a MappedByteBuffer and array reads.
FileChannel with a direct ByteBuffer and array reads.
FileChannel with a wrapped array ByteBuffer and direct array access.
For the best Java read performance, there are 4 things to remember:
Minimize I/O operations by reading an array at a time, not a byte at
a time. An 8 KB array is a good size (that's why it's a default value for BufferedInputStream).
Minimize method calls by getting data an array at a time, not a byte
at a time. Use array indexing to get at bytes in the array.
Minimize thread synchronization locks if you don't need thread
safety. Either make fewer method calls to a thread-safe class, or use
a non-thread-safe class like FileChannel and MappedByteBuffer.
Minimize data copying between the JVM/OS, internal buffers, and
application arrays. Use FileChannel with memory mapping, or a direct
or wrapped array ByteBuffer.

Map your file into memory!
Java 7 code:
FileChannel channel = FileChannel.open(Paths.get("/path/to/file"),
StandardOpenOption.READ);
ByteBuffer buf = channel.map(0, channel.size(),
FileChannel.MapMode.READ_ONLY);
// use buf
See here for more details.
If you use Java 6, you'll have to:
RandomAccessFile file = new RandomAccessFile("/path/to/file", "r");
FileChannel channel = file.getChannel();
// same thing to obtain buf
You can even use .asIntBuffer() on the buffer if you want. And you can read only what you actually need to read, when you need to read it. And it does not impact your heap.

ReadFully() Comes at the risk of choking?

I noticed when I use readFully() on a file instead of the read(byte[]), processing time is reduced greatly. However, it occured to me that readFully may be a double edged sword. If I accidentlly try to read in a huge, multi-gigabyte file, it could choke?
Here is a function I am using to generate an SHA-256 checksum:
public static byte[] createChecksum(File log, String type) throws Exception {
DataInputStream fis = new DataInputStream(new FileInputStream(log));
Long len = log.length();
byte[] buffer = new byte[len.intValue()];
fis.readFully(buffer); // TODO: readFully may come at the risk of
// choking on a huge file.
fis.close();
MessageDigest complete = MessageDigest.getInstance(type);
complete.update(buffer);
return complete.digest();
}
If I were to instead use:
DataInputStream fis = new DataInputStream(new BufferedInputStream(new FileInputStream(log)));
Would that allieviate this risk? Or... is the best option (in situations where you can't garuntee data size) to always control the amount of bytes read in and use a loop till all bytes are read?
(Come to think of it, since the MessageDigest API takes in the full byte array at once, I'm not sure how to attain a checksum without stuffing all the data in at once, but I suppose that is another question for another thread.

You should just allocate a decently-sized buffer (65536 bytes perhaps), and do a loop where you read 64kb at a time, using "complete.update()" to append to the digester inside the loop. Be careful on the last block so you only process the number of bytes read (probably less than 64kb)

Reading the file will take as long as it takes, whether you use readFully() or not.
Whether you can actually allocate gigabyte-sized byte arrays is another question. There is no need to use readFully() at all when downloading files. It's for use in wire protocols where say the next 12 bytes are an identifier followed by another 60 bytes of address information and you don't want to have to keep writing loops.

readFully() isn't going to choke if the file is multiple gigabytes, but allocating that byte buffer will. You'll get an out-of-memory exception before you ever get to the call to readFully().
You need to use the method of updating the hash with chunks of the file repeatedly, rather than updating it all at once with the entire file.

How to determine the buffer size for BufferedOutputStream's write method

I am trying to copy a file using the following code:
1:
int data=0;
byte[] buffer = new byte[4096];
while((data = bufferedInputStream.read())!=-1){
bufferedOutputStream.write(data);
}
2:
byte[] buffer = new byte[4096];
while(bufferedInputStream.read(buffer)!=-1){
bufferedOutputStream.write(buffer);
}
Actual size of file is 3892028 bytes(on windows). The file will be uploaded by the user thro struts2 fileupload. Uploaded file size is exactly same as that of windows. When I try to copy the uploaded file from temporary folder, the copied file varies in size and the time taken also varies(it is negligible). Please find the below readings.
Without using buffer(Code 1)
Time taken 77
3892028
3891200
Buffer size 1024(Code 2)
Time taken 17
3892028
3891200
Buffer size 4096(Code 2)
Time taken 18
3892028
3891200
Buffer size 10240(Code 2)
Time taken 14
3892028
3901440
Buffer size 102400(Code 2)
Time taken 9
3892028
3993600
If I increase the buffer size further, time taken increases, again it is negligible. So my questions are,
Why the file size changes?
Is there any subtle consequences due to this size variation?
What is the best way to accomplish this functionality(copying a file)?
I don't know what is going beneath? Thanks for any suggestion.
Edit: I have flush() and close() method calls.
Note: I have trimmed my code to make it simpler.

The problem is, BufferedInputStream.read(byte[]) reads as much as it can into the buffer. So if the stream contains only 1 byte, only the first byte of byte array will be filled. However, BufferedInputStream.write(byte[]) writes all the given bytes into the stream, meaning it will still write full 4096 bytes, containing 1 byte from current iteration and 4095 remaining bytes from previous iteration.
What you need to do, is save the amount of bytes that were read, and then write the same amount.
Example:
int lastReadCnt = 0;
byte[] buffer = new byte[4096];
while((lastReadCnt = bufferedInputStream.read(buffer))!=-1){
bufferedOutputStream.write(buffer, 0, lastReadCnt);
}
References:
Java 6: InputStream: read(byte[],int,int)
Java 6: OutputStream: write(byte[],int,int)

Why the file size changes?
You forgot to `flush()` (and `close()`):
bufferedOutputStream.flush()
Also you should pass the number of bytes read to write method:
bufferedOutputStream.write(data, 0, bytesRead);
What is the best way to accomplish this functionality(copying a file)?
FileUtils.copyFile()
IOUtils.copy()
Both from apache-commons IO.

How to initialize a ByteBuffer if you don't know how many bytes to allocate beforehand?

Is this:
ByteBuffer buf = ByteBuffer.allocate(1000);
...the only way to initialize a ByteBuffer?
What if I have no idea how many bytes I need to allocate..?
Edit: More details:
I'm converting one image file format to a TIFF file. The problem is the starting file format can be any size, but I need to write the data in the TIFF to little endian. So I'm reading the stuff I'm eventually going to print to the TIFF file into the ByteBuffer first so I can put everything in Little Endian, then I'm going to write it to the outfile. I guess since I know how long IFDs are, headers are, and I can probably figure out how many bytes in each image plane, I can just use multiple ByteBuffers during this whole process.

The types of places that you would use a ByteBuffer are generally the types of places that you would otherwise use a byte array (which also has a fixed size). With synchronous I/O you often use byte arrays, with asynchronous I/O, ByteBuffers are used instead.
If you need to read an unknown amount of data using a ByteBuffer, consider using a loop with your buffer and append the data to a ByteArrayOutputStream as you read it. When you are finished, call toByteArray() to get the final byte array.
Any time when you aren't absolutely sure of the size (or maximum size) of a given input, reading in a loop (possibly using a ByteArrayOutputStream, but otherwise just processing the data as a stream, as it is read) is the only way to handle it. Without some sort of loop, any remaining data will of course be lost.
For example:
final byte[] buf = new byte[4096];
int numRead;
// Use try-with-resources to auto-close streams.
try(
final FileInputStream fis = new FileInputStream(...);
final ByteArrayOutputStream baos = new ByteArrayOutputStream()
) {
while ((numRead = fis.read(buf)) > 0) {
baos.write(buf, 0, numRead);
}
final byte[] allBytes = baos.toByteArray();
// Do something with the data.
}
catch( final Exception e ) {
// Do something on failure...
}
If you instead wanted to write Java ints, or other things that aren't raw bytes, you can wrap your ByteArrayOutputStream in a DataOutputStream:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
while (thereAreMoreIntsFromSomewhere()) {
int someInt = getIntFromSomewhere();
dos.writeInt(someInt);
}
byte[] allBytes = baos.toByteArray();

Depends.
Library
Converting file formats tends to be a solved problem for most problem domains. For example:
Batik can transcode between various image formats (including TIFF).
Apache POI can convert between office spreadsheet formats.
Flexmark can generate HTML from Markdown.
The list is long. The first question should be, "What library can accomplish this task?" If performance is a consideration, your time is likely better spent optimising an existing package to meet your needs than writing yet another tool. (As a bonus, other people get to benefit from the centralised work.)
Known Quantities
Reading a file? Allocate file.size() bytes.
Copying a string? Allocate string.length() bytes.
Copying a TCP packet? Allocate 1500 bytes, for example.
Unknown Quantities
When the number of bytes is truly unknown, you can do a few things:
Make a guess.
Analyze example data sets to buffer; use the average length.
Example
Java's StringBuffer, unless otherwise instructed, uses an initial buffer size to hold 16 characters. Once the 16 characters are filled, a new, longer array is allocated, and then the original 16 characters copied. If the StringBuffer had an initial size of 1024 characters, then the reallocation would not happen as early or as often.
Optimization
Either way, this is probably a premature optimization. Typically you would allocate a set number of bytes when you want to reduce the number of internal memory reallocations that get executed.
It is unlikely that this will be the application's bottleneck.

The idea is that it's only a buffer - not the whole of the data. It's a temporary resting spot for data as you read a chunk, process it (possibly writing it somewhere else). So, allocate yourself a big enough "chunk" and it normally won't be a problem.
What problem are you anticipating?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.