Reading a binary file in Java vs C++

Reading a binary file in Java vs C++ - java

I have a binary file (about 100 MB) that I need to read in quickly. In C++ I could just load the file into a char pointer and march through it by incrementing the pointer. This of course would be very fast.
Is there a comparably fast way to do this in Java?

If you use a memory mapped file or regular buffer you will be able to read the data as fast your hardware allows.
File tmp = File.createTempFile("deleteme", "bin");
tmp.deleteOnExit();
int size = 1024 * 1024 * 1024;
long start0 = System.nanoTime();
FileChannel fc0 = new FileOutputStream(tmp).getChannel();
ByteBuffer bb = ByteBuffer.allocateDirect(32 * 1024).order(ByteOrder.nativeOrder());
for (int i = 0; i < size; i += bb.capacity()) {
fc0.write(bb);
bb.clear();
}
long time0 = System.nanoTime() - start0;
System.out.printf("Took %.3f ms to write %,d MB using ByteBuffer%n", time0 / 1e6, size / 1024 / 1024);
long start = System.nanoTime();
FileChannel fc = new FileInputStream(tmp).getChannel();
MappedByteBuffer buffer = fc.map(FileChannel.MapMode.READ_ONLY, 0, size);
LongBuffer longBuffer = buffer.order(ByteOrder.nativeOrder()).asLongBuffer();
long total = 0; // used to prevent a micro-optimisation.
while (longBuffer.remaining() > 0)
total += longBuffer.get();
fc.close();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f ms to read %,d MB MemoryMappedFile%n", time / 1e6, size / 1024 / 1024);
long start2 = System.nanoTime();
FileChannel fc2 = new FileInputStream(tmp).getChannel();
bb.clear();
while (fc2.read(bb) > 0) {
while (bb.remaining() > 0)
total += bb.get();
bb.clear();
}
fc2.close();
long time2 = System.nanoTime() - start2;
System.out.printf("Took %.3f ms to read %,d MB File via NIO%n", time2 / 1e6, size / 1024 / 1024);
prints
Took 305.243 ms to write 1,024 MB using ByteBuffer
Took 286.404 ms to read 1,024 MB MemoryMappedFile
Took 155.598 ms to read 1,024 MB File via NIO
This is for a file 10x larger than what you want. Its this fast because the data is being cached in memory (and I have an SSD drive). If you have fast hardware, the data can be read pretty fast.

Sure, you could use a memory mapped file.
Here are two good links with sample code:
Thinking in Java: Memory-mapped files
Java Tips: How to create a memory-mapped file
If you don't want to go this route, just use an ordinary InputStream (such as a DataInputStream after wrapping it in a BufferedInputStream.

Most files will not need memory mapping but can simply be read by the standard Java I/O, especially since your file is so small. A reasonable way to read said files is by using a BufferedInputStream.
InputStream in = new BufferedInputStream(new FileInputStream("somefile.ext"));
Buffering is already optimized in Java for most computers. If you had a larger file, say 100MB, then you would look at optimizing it further.

Reading the file from the disk is going to be the slowest part by miles, so it's likely to make no difference whatsoever. Of this individual operation, of course- the JVM still takes a decade to start up, so add that time in.

Take a look at this blog post here on how to read a binary file into a byte array in Java:
http://www.spartanjava.com/2008/read-a-file-into-a-byte-array/
Copied from link:
File file = new File("/somepath/myfile.ext");
FileInputStream is = new FileInputStream(file);
// Get the size of the file
long length = file.length();
if (length > Integer.MAX_VALUE) {
throw new IOException("The file is too big");
}
// Create the byte array to hold the data
byte[] bytes = new byte[(int)length];
// Read in the bytes
int offset = 0;
int numRead = 0;
while (offset < bytes.length
&& (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) {
offset += numRead;
}
// Ensure all the bytes have been read in
if (offset < bytes.length) {
throw new IOException("The file was not completely read: "+file.getName());
}
// Close the input stream, all file contents are in the bytes variable
is.close()

Using the DataInputStream of the Java SDK can be helpful here. DataInputStream provide such functions as readByte() or readChar(), if that's what needed.
A simple example can be:
DataInputStream dis = new DataInputStream(new FileInputStream("file.dat"));
try {
while(true) {
byte b = dis.readByte();
//Do something with the byte
}
} catch (EOFException eofe) {
//Stream Ended
} catch (IOException ioe) {
//Input exception
}
Hope it helps. You can, of course, read the entire stream to a byte array and iterate through it as well...

Related

Fastest way to write multiple files in java

I have requirement where I need to write multiple input streams to a temp file in java. I have the below code snippet for the logic. Is there a better way to do this in an efficient manner?
final String tempZipFileName = "log" + "_" + System.currentTimeMillis();
File tempFile = File.createTempFile(tempZipFileName, "zip");
final FileOutputStream oswriter = new FileOutputStream(tempFile);
for (final InputStream inputStream : readerSuppliers) {
byte[] buffer = new byte[102400];
int bytesRead = 0;
while ((bytesRead = inputStream.read(buffer)) > 0) {
oswriter.write(buffer, 0, bytesRead);
}
buffer = null;
oswriter.write(System.getProperty("line.separator").getBytes());
inputStream.close();
}
I have multiple files of size ranging from 45 to 400 mb, for a typical 45mb and 360 mb files this method is taking around 3 mins on average. Can this be further improved?

You could try a BufferedInputStream
As #StephenC replied is it unrelevant in this case to use a BufferedInputStream because the buffer is big enough.
I reproduced the behaviour on my computer (with an SSD drive). I took a 100MB file.
It took 110ms to create the new file with this example.
With an InputStreamBuffer and an OutputStream = 120 ms.
With an InputStream and an OutputStreamBuffer = 120 ms.
With an InputStreamBuffer and an
OutputStreamBuffer = 110 ms.
I don't have a so long execution time as your's.
Maybe the problem comes from your readerSuppliers ?

BufferedInputStream to ByteArrayOutputStream very slow

I have a problem very similar to the link below:
PDF to byte array and vice versa
The main difference being I am trying to interpret a Socket connection via a ServerSocket containing Binary, rather than a file.
This works as expected.
However, the problem I am having is that this process is taking quite a long time to read into memory, about 1 minute 30 seconds for 500 bytes (although the size of each stream will vary massively)
Here's my code:
BufferedInputStream input = new BufferedInputStream(theSocket.getInputStream());
byte[] buffer = new byte[8192];
int bytesRead;
ByteArrayOutputStream output = new ByteArrayOutputStream();
while ((bytesRead = input.read(buffer)) != -1)
{
output.write(buffer, 0, bytesRead);
}
byte[] outputBytes = output.toByteArray();
//Continue ... and eventually close inputstream
If I log it's progress within the while loop within the terminal it seems to log all the bytes quite quickly (i.e. reaches the end of the stream), but then seems to pause for a time before breaking out of the while loop and continuing.
Hope that makes sense.

Well you're reading until the socket is closed, basically - that's when read will return -1.
So my guess is that the other end of the connection is holding it open for 90 seconds before closing it. Fix that, and you'll fix your problem.

ByteArrayOutputStream(int size);
By default the size is 32 bytes so it increses like this: 32->64->128->256->...
So initialize it with a bigger capacity.

You can time how long it takes to copy data between a BufferedInputStream and a ByteArrayOutputStream.
int size = 256 << 20; // 256 MB
ByteArrayInputStream bais = new ByteArrayInputStream(new byte[size]);
long start = System.nanoTime();
BufferedInputStream input = new BufferedInputStream(bais);
byte[] buffer = new byte[8192];
int bytesRead;
ByteArrayOutputStream output = new ByteArrayOutputStream();
while ((bytesRead = input.read(buffer)) != -1) {
output.write(buffer, 0, bytesRead);
}
byte[] outputBytes = output.toByteArray();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to copy %,d MB %n", time / 1e9, size >> 20);
prints
Took 0.365 seconds to copy 256 MB
It will be much faster for smaller messages i.e. << 256 MB.

Stop HtmlUnit download after specified file size is reached

I'm stuck trying to stop a download initiated with HtmlUnit after a certain size was reached. The InputStream
InputStream input = button.click().getWebResponse().getContentAsStream();
downloads the complete file correctly. However, seems like using
OutputStream output = new FileOutputStream(fileName);
int bytesRead;
int total = 0;
while ((bytesRead = input.read(buffer)) != -1 && total < MAX_SIZE) {
output.write(buffer, 0, bytesRead);
total += bytesRead;
System.out.print(total + "\n");
}
output.flush();
output.close();
input.close();
somehow downloads the file to a different location (unknown to me) and once finished copies the max size into the file "fileName". No System.out is printed during this process. Interestingly, while running the debugger in Netbeans and going slowly step-by-step, the total is printed and I get the MAX_SIZE file.
Varying the buffer size in a range between 1024 to 102400 didn't make any difference.
I also tried Commons'
BoundedInputStream b = new BoundedInputStream(button.click().getWebResponse().getContentAsStream(), MAX_SIZE);
without success.
There's this 2,5 years old post, but I couldn't figure out how to implement the proposed solution.
Is there something I'm missing in order to stop the download at MAX_SIZE?
(Exceptions handling and other etcetera omitted for brevity)

There is no need to use HTMLUnit for this. Actually, using it to such a simple task is a very overkill solution and will make things slow. The best approach I can think of is the following:
final String url = "http://yoururl.com";
final String file = "/path/to/your/outputfile.zip";
final int MAX_BYTES = 1024 * 1024 * 5; // 5 MB
URLConnection connection = new URL(url).openConnection();
InputStream input = connection.getInputStream();
byte[] buffer = new byte[4096];
int pendingRead = MAX_BYTES;
int n;
OutputStream output = new FileOutputStream(new File(file));
while ((n = input.read(buffer)) >= 0 && (pendingRead > 0)) {
output.write(buffer, 0, Math.min(pendingRead, n));
pendingRead -= n;
}
input.close();
output.close();
In this case I've set a maximum download size of 5 MB and a buffer of 4 KB. The file will be written to disk in every iteration of the while loop, which seems to be what you're looking for.
Of course, make sure you handle all the needed exceptions (eg: FileNotFoundException).

Calculating Download Speed

I am downloading a file but trying to also determine the download speed in KBps. I came up with an equation, but it is giving strange results.
try (BufferedInputStream in = new BufferedInputStream(url.openStream());
FileOutputStream out = new FileOutputStream(file)) {
byte[] buffer = new byte[4096];
int read = 0;
while (true) {
long start = System.nanoTime();
if ((read = in.read(buffer)) != -1) {
out.write(buffer, 0, read);
} else {
break;
}
int speed = (int) ((read * 1000000000.0) / ((System.nanoTime() - start) * 1024.0));
}
}
It's giving me anywhere between 100 and 300,000. How can I make this give the correct download speed? Thanks

You are not checking your currentAmmount and previousAmount of file downloading.
example
int currentAmount = 0;//set this during each loop of the download
/***/
int previousAmount = 0;
int firingTime = 1000;//in milliseconds, here fire every second
public synchronyzed void run(){
int bytesPerSecond = (currentAmount-previousAmount)/(firingTime/1000);
//update GUI using bytesPerSecond
previousAmount = currentAmount;
}

First, you are calculating read() time + write() time in very short intervals and the result will vary depending on the (disk cache) flushing of the writes().
Put the calculation right after the read()
Second, your buffer size (4096) probably does not match the tcp buffer size (yours is eventually smaller), and because of that some reads will be very fast (because it is read from the local TCP buffer). Use Socket.getReceiveBufferSize()
and set the size of your buffer accordingly (let say 2* the size of TCP recv buf size) and fill it in a nested loop until full before calculating.

Gets the uncompressed size of this GZIPInputStream?

I have a GZIPInputStream that I constructed from another ByteArrayInputStream. I want to know the original (uncompressed) length for the gzip data. Although I can read to the end of the GZIPInputStream, then count the number, it will cost much time and waste CPU. I would like to know the size before read it.
Is there a similiar method like ZipEntry.getSize() for GZIPInputStream:
public long getSize ()
Since: API Level 1
Gets the uncompressed size of this ZipEntry.

It is possible to determine the uncompressed size by reading the last four bytes of the gzipped file.
I found this solution here:
http://www.abeel.be/content/determine-uncompressed-size-gzip-file
Also from this link there is some example code (corrected to use long instead of int, to cope with sizes between 2GB and 4GB which would make an int wrap around):
RandomAccessFile raf = new RandomAccessFile(file, "r");
raf.seek(raf.length() - 4);
byte b4 = raf.read();
byte b3 = raf.read();
byte b2 = raf.read();
byte b1 = raf.read();
long val = ((long)b1 << 24) | ((long)b2 << 16) | ((long)b3 << 8) | (long)b4;
raf.close();
val is the length in bytes. Beware: you can not determine the correct uncompressed size, when the uncompressed file was greater than 4GB!

Based on #Alexander's answer:
RandomAccessFile raf = new RandomAccessFile(inputFilePath + ".gz", "r");
raf.seek(raf.length() - 4);
byte[] bytes = new byte[4];
raf.read(bytes);
fileSize = ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).getInt();
if (fileSize < 0)
fileSize += (1L << 32);
raf.close();

Is there a similiar method like ZipEntry.getSize() for
GZIPInputStream
No. It's not in the Javadoc => it doesn't exist.
What do you need the length for?

There is no reliable way to get the length other than decompressing the whole thing. See Uncompressed file size using zlib's gzip file access function .

If you can guess at the compression ratio (a reasonable expectation if the data is similar to other data you've already processed), then you can work out the size of arbitrarily large files (with some error). Again, this assumes a file containing a single gzip stream. The following assumes the first size greater than 90% of the estimated size (based on estimated ratio) is the true size:
estCompRatio = 6.1;
RandomAccessFile raf = new RandomAccessFile(inputFilePath + ".gz", "r");
compLength = raf.length();
byte[] bytes = new byte[4];
raf.read(bytes);
uncLength = ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).getInt();
raf.seek(compLength - 4);
uncLength = raf.readInt();
while(uncLength < (compLength * estCompRatio * 0.9)){
uncLength += (1L << 32);
}
[setting estCompRatio to 0 is equivalent to #Alexander's answer]

A more compact version of the calculation based on the 4 tail bytes (avoids using a byte buffer, calls Integer.reverseBytes to reverse the byte order of read bytes).
private static long getUncompressedSize(Path inputPath) throws IOException
{
long size = -1;
try (RandomAccessFile fp = new RandomAccessFile(inputPath.toFile(), "r")) {
fp.seek(fp.length() - Integer.BYTES);
int n = fp.readInt();
size = Integer.toUnsignedLong(Integer.reverseBytes(n));
}
return size;
}

Get the FileChannel from the underlying FileInputStream instead. It tells you both file size and current position of the compressed file. Example:
#Override
public void produce(final DataConsumer consumer, final boolean skipData) throws IOException {
try (FileInputStream fis = new FileInputStream(tarFile)) {
FileChannel channel = fis.getChannel();
final Eta<Long> eta = new Eta<>(channel.size());
try (InputStream is = tarFile.getName().toLowerCase().endsWith("gz")
? new GZIPInputStream(fis) : fis) {
try (TarArchiveInputStream tais = (TarArchiveInputStream) new ArchiveStreamFactory()
.createArchiveInputStream("tar", new BufferedInputStream(is))) {
TarArchiveEntry tae;
boolean done = false;
while (!done && (tae = tais.getNextTarEntry()) != null) {
if (tae.getName().startsWith("docs/") && tae.getName().endsWith(".html")) {
String data = null;
if (!skipData) {
data = new String(tais.readNBytes((int) tae.getSize()), StandardCharsets.UTF_8);
}
done = !consumer.consume(data);
}
String progress = eta.toStringPeriodical(channel.position());
if (progress != null) {
System.out.println(progress);
}
}
System.out.println("tar bytes read: " + tais.getBytesRead());
} catch (ArchiveException ex) {
throw new IOException(ex);
}
}
}
}

No, unfortunately if you wanted to get the uncompressed size, you would have to read the entire stream and increment a counter like you mention in your question. Why do you need to know the size? Could an estimation of the size work for your purposes?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading a binary file in Java vs C++ - java

I have a binary file (about 100 MB) that I need to read in quickly. In C++ I could just load the file into a char pointer and march through it by incrementing the pointer. This of course would be very fast. Is there a comparably fast way to do this in Java?

Reading the file from the disk is going to be the slowest part by miles, so it's likely to make no difference whatsoever. Of this individual operation, of course- the JVM still takes a decade to start up, so add that time in.

Related

Fastest way to write multiple files in java

BufferedInputStream to ByteArrayOutputStream very slow

Stop HtmlUnit download after specified file size is reached

Calculating Download Speed

Gets the uncompressed size of this GZIPInputStream?

Categories

Resources