I noticed when I use readFully() on a file instead of the read(byte[]), processing time is reduced greatly. However, it occured to me that readFully may be a double edged sword. If I accidentlly try to read in a huge, multi-gigabyte file, it could choke?
Here is a function I am using to generate an SHA-256 checksum:
public static byte[] createChecksum(File log, String type) throws Exception {
DataInputStream fis = new DataInputStream(new FileInputStream(log));
Long len = log.length();
byte[] buffer = new byte[len.intValue()];
fis.readFully(buffer); // TODO: readFully may come at the risk of
// choking on a huge file.
fis.close();
MessageDigest complete = MessageDigest.getInstance(type);
complete.update(buffer);
return complete.digest();
}
If I were to instead use:
DataInputStream fis = new DataInputStream(new BufferedInputStream(new FileInputStream(log)));
Would that allieviate this risk? Or... is the best option (in situations where you can't garuntee data size) to always control the amount of bytes read in and use a loop till all bytes are read?
(Come to think of it, since the MessageDigest API takes in the full byte array at once, I'm not sure how to attain a checksum without stuffing all the data in at once, but I suppose that is another question for another thread.
You should just allocate a decently-sized buffer (65536 bytes perhaps), and do a loop where you read 64kb at a time, using "complete.update()" to append to the digester inside the loop. Be careful on the last block so you only process the number of bytes read (probably less than 64kb)
Reading the file will take as long as it takes, whether you use readFully() or not.
Whether you can actually allocate gigabyte-sized byte arrays is another question. There is no need to use readFully() at all when downloading files. It's for use in wire protocols where say the next 12 bytes are an identifier followed by another 60 bytes of address information and you don't want to have to keep writing loops.
readFully() isn't going to choke if the file is multiple gigabytes, but allocating that byte buffer will. You'll get an out-of-memory exception before you ever get to the call to readFully().
You need to use the method of updating the hash with chunks of the file repeatedly, rather than updating it all at once with the entire file.
Related
I need to convert a stream of char into a stream of bytes, i.e. I need an adapter from a java.io.Writer interface to a java.io.OutputStream, supporting any valid Charset which I will have as a configuration parameter.
However, the java.io.OutputStreamWriter class has a hidden secret: the sun.nio.cs.StreamEncoder object it delegates to underneath creates an 8192 byte (8KB) buffer, even if you don't ask it to.
The problem is, at the OutputStream end I have inserted a wrapper that needs to count the amount of bytes being written, so that it immediately stops execution of the source system once a specific amount of bytes has been output. And if OutputStreamWriter is creating an 8K buffer, I simply get notified of the amount of bytes generated too late because they will only reach my counter when the buffer is flushing (so there will be already more than 8,000 already-generated bytes waiting for me at the OutputStreamWriter buffer).
So the question is, is there anywhere in the Java runtime a Writer -> OutputStream bridge that can run unbuffered?
I would really, really hate to have to write this myself :(...
NOTE: hitting flush() on the OutputStreamWriter for each write is not a valid alternative. That brings a large performance penalty (there's a synchronized block involved at the StreamEncoder).
NOTE 2: I understand it might be necessary to keep a small char overflow at the bridge in order to compute surrogates. It's not that I need to stop the execution of the source system in the very moment it generates the n-th byte (that would not be possible given bytes can come to me in the form of a larger byte[] in a write call). But I need to stop it asap, and waiting for an 8K, 2K or even 200-byte buffer to flush would simply be too late.
As you have already detected the StreamEncoder used by OutputStreamWriter has a buffer size of 8KB and there is no interface to change that size.
But the following snippet gives you a way to obtain a Writer for a OutputStream which internally also uses a StreamEncoder but now has a user-defined buffer size:
String charSetName = ...
CharsetEncoder encoder = Charset.forName(charSetName).newEncoder();
OutputStream out = ...
int bufferSize = ...
WritableByteChannel channel = Channels.newChannel(out);
Writer writer = Channels.newWriter(channel, encoder, bufferSize);
As I understand ByteArrayInputStream is used to read byte[] data.
Why should I use it rather than simple byte[] (for example reading it from DB).
What is the different between them?
If the input is always a byte[], then you're right, there's often no need for the stream. And if you don't need it, don't use it. One additional advantage of a ByteArrayInputStream is that it serves as a very strong indication that you intend the bytes to be read-only (since the stream doesn't provide an interface for changing them), though it's important to note that a programmer can often still access the bytes directly, so you shouldn't use that in a situation where security is a concern.
But if it's sometimes a byte[], sometimes a file, sometimes a network connection, etc, then you need some sort of abstraction for "a stream of bytes, and I don't care where they come from." That's what an InputStream is. When the source happens to be a byte array, ByteArrayInputStream is a good InputStream to use.
This is helpful in many situations, but to give two concrete examples:
You're writing a library that takes bytes and processes them somehow (maybe it's an image processing library, for instance). Users of your library may supply bytes from a file, or from a byte[] in memory, or from some other source. So, you provide an interface that accepts an InputStream — which means that if what they have is a byte[], they need to wrap it in a ByteArrayInputStream.
You're writing code that reads a network connection. But to unit test that code, you don't want to have to open up a connection; you want to just supply some bytes in the code. So the code takes an InputStream, and your test provides a ByteArrayInputStream.
A ByteArrayInputStream contains an internal buffer that contains bytes that
may be read from the stream. An internal counter keeps track of the next byte to be supplied by the read method.
ByteArrayInputStream is like wrapper which protects underlying array from external modification
It has high order read ,mark ,skip functions
A stream also has the advantage that you don't have to have all bytes in memory at the same time, which is convenient if the size of the data is large and can easily be handled in small chunks.
Reference doc
Where as if you choose byte[] ,then you have to generate wheels to do reading ,skipping and track current index explicitly
byte data[] = { 65, 66, 67, 68, 69 }; // data
for (int index = 0; index < data.length; index++) {
System.out.print((char) data[index] + " ");
}
int c = 0;
ByteArrayInputStream bInput = new ByteArrayInputStream(data);
while ((bInput.read()) != -1) {
System.out.println(Character.toUpperCase((char) c));
}
ByteArrayInputStream is a good wrapper for byte[], the core is understanding stream, a stream is an ordered sequence of bytes of indeterminate length.Input streams move bytes of data into a
java program from some generally external source, in java io, you can decorate one stream to another stream to get more function. but the performance maybe bad. the power of the stream metaphor is that difference between these source and destinations are abstracted way,all input and output operations are simply traded as streams using the same class and the same method,you don not learn a new API for every different kind of device, the same API that read file can read network sockets,serial ports, Bluetooth transmissions, and more.
I have a function in which I am only given a BufferedInputStream and no other information about the file to be read. I unfortunately cannot alter the method definition as it is called by code I don't have access to. I've been using the code below to read the file and place its contents in a String:
public String[] doImport(BufferedInputStream stream) throws IOException, PersistenceException {
int bytesAvail = stream.available();
byte[] bytesRead = new byte[bytesAvail];
stream.read(bytesRead);
stream.close();
String fileContents = new String(bytesRead);
//more code here working with fileContents
}
My problem is that for large files (>2Gb), this code causes the program to either run extremely slowly or truncate the data, depending on the computer the program is executed on. Does anyone have a recommendation regarding how to deal with large files in this situation?
You're assuming that available() returns the size of the file; it does not. It returns the number of bytes available to be read, and that may be any number less than or equal to the size of the file.
Unfortunately there's no way to do what you want in just one shot without having some other source of information on the length of the file data (i.e., by calling java.io.File.length()). Instead, you have to possibly accumulate from multiple reads. One way is by using ByteArrayOutputStream. Read into a fixed, finite-size array, then write the data you read into a ByteArrayOutputStream. At the end, pull the byte array out. You'll need to use the three-argument forms of read() and write() and look at the return value of read() so you know exactly how many bytes were read into the buffer on each call.
I'm not sure why you don't think you can read it line-by-line. BufferedInputStream only describes how the underlying stream is accessed, it doesn't impose any restrictions on how you ultimately read data from it. You can use it just as if it were any other InputStream.
Namely, to read it line-by-line you could do
InputStreamReader streamReader = new InputStreamReader(stream);
BufferedInputReader lineReader = new BufferedInputReader(streamReader);
String line = lineReader.readLine();
...
[Edit] This response is to the original wording of the question, which asked specifically for a way to read the input file line-by-line.
For good or bad I have been using code like the following without any problems:
ZipFile aZipFile = new ZipFile(fileName);
InputStream zipInput = aZipFile.getInputStream(name);
int theSize = zipInput.available();
byte[] content = new byte[theSize];
zipInput.read(content, 0, theSize);
I have used it (this logic of obtaining the available size and reading directly to a byte buffer)
for File I/O without any issues and I used it with zip files as well.
But recently I stepped into a case that the zipInput.read(content, 0, theSize); actually reads 3 bytes less that the theSize available.
And since the code is not in a loop to check the length returned by zipInput.read(content, 0, theSize); I read the file with the 3 last bytes missing
and later the program can not function properly (the file is a binary file).
Strange enough with different zip files of larger size e.g. 1075 bytes (in my case the problematic zip entry is 867 bytes) the code works fine!
I understand that the logic of the code is probably not the "best" but why am I suddenly getting this problem now?
And how come if I run the program immediately with a larger zip entry it works?
Any input is highly welcome
Thanks
From the InputStream read API docs:
An attempt is made to read as many as len bytes, but a smaller number
may be read.
... and:
Returns: the total number of bytes read into the buffer, or -1 if
there is no more data because the end of the stream has been reached.
In other words unless the read method returns -1 there is still more data available to read, but you cannot guarantee that read will read exactly the specified number of bytes. The specified number of bytes is the upper bound describing the maximum amount of data it will read.
Using available() does not guarantee that it counted total available bytes to the end of stream.
Refer to Java InputStream's available() method. It says that
Returns an estimate of the number of bytes that can be read (or skipped over) from this input stream without blocking by the next invocation of a method for this input stream. The next invocation might be the same thread or another thread. A single read or skip of this many bytes will not block, but may read or skip fewer bytes.
Note that while some implementations of InputStream will return the total number of bytes in the stream, many will not. It is never correct to use the return value of this method to allocate a buffer intended to hold all data in this stream.
An example solution for your problem can be as follows:
ZipFile aZipFile = new ZipFile(fileName);
InputStream zipInput = aZipFile.getInputStream( caImport );
int available = zipInput.available();
byte[] contentBytes = new byte[ available ];
while ( available != 0 )
{
zipInput.read( contentBytes );
// here, do what ever you want
available = dis.available();
} // while available
...
This works for sure on all sizes of input files.
The best way to do this should be as bellows:
public static byte[] readZipFileToByteArray(ZipFile zipFile, ZipEntry entry)
throws IOException {
InputStream in = null;
try {
in = zipFile.getInputStream(entry);
return IOUtils.toByteArray(in);
} finally {
IOUtils.closeQuietly(in);
}
}
where the IOUtils.toByteArray(in) method keeps reading until EOF and then return the byte array.
Is this:
ByteBuffer buf = ByteBuffer.allocate(1000);
...the only way to initialize a ByteBuffer?
What if I have no idea how many bytes I need to allocate..?
Edit: More details:
I'm converting one image file format to a TIFF file. The problem is the starting file format can be any size, but I need to write the data in the TIFF to little endian. So I'm reading the stuff I'm eventually going to print to the TIFF file into the ByteBuffer first so I can put everything in Little Endian, then I'm going to write it to the outfile. I guess since I know how long IFDs are, headers are, and I can probably figure out how many bytes in each image plane, I can just use multiple ByteBuffers during this whole process.
The types of places that you would use a ByteBuffer are generally the types of places that you would otherwise use a byte array (which also has a fixed size). With synchronous I/O you often use byte arrays, with asynchronous I/O, ByteBuffers are used instead.
If you need to read an unknown amount of data using a ByteBuffer, consider using a loop with your buffer and append the data to a ByteArrayOutputStream as you read it. When you are finished, call toByteArray() to get the final byte array.
Any time when you aren't absolutely sure of the size (or maximum size) of a given input, reading in a loop (possibly using a ByteArrayOutputStream, but otherwise just processing the data as a stream, as it is read) is the only way to handle it. Without some sort of loop, any remaining data will of course be lost.
For example:
final byte[] buf = new byte[4096];
int numRead;
// Use try-with-resources to auto-close streams.
try(
final FileInputStream fis = new FileInputStream(...);
final ByteArrayOutputStream baos = new ByteArrayOutputStream()
) {
while ((numRead = fis.read(buf)) > 0) {
baos.write(buf, 0, numRead);
}
final byte[] allBytes = baos.toByteArray();
// Do something with the data.
}
catch( final Exception e ) {
// Do something on failure...
}
If you instead wanted to write Java ints, or other things that aren't raw bytes, you can wrap your ByteArrayOutputStream in a DataOutputStream:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
while (thereAreMoreIntsFromSomewhere()) {
int someInt = getIntFromSomewhere();
dos.writeInt(someInt);
}
byte[] allBytes = baos.toByteArray();
Depends.
Library
Converting file formats tends to be a solved problem for most problem domains. For example:
Batik can transcode between various image formats (including TIFF).
Apache POI can convert between office spreadsheet formats.
Flexmark can generate HTML from Markdown.
The list is long. The first question should be, "What library can accomplish this task?" If performance is a consideration, your time is likely better spent optimising an existing package to meet your needs than writing yet another tool. (As a bonus, other people get to benefit from the centralised work.)
Known Quantities
Reading a file? Allocate file.size() bytes.
Copying a string? Allocate string.length() bytes.
Copying a TCP packet? Allocate 1500 bytes, for example.
Unknown Quantities
When the number of bytes is truly unknown, you can do a few things:
Make a guess.
Analyze example data sets to buffer; use the average length.
Example
Java's StringBuffer, unless otherwise instructed, uses an initial buffer size to hold 16 characters. Once the 16 characters are filled, a new, longer array is allocated, and then the original 16 characters copied. If the StringBuffer had an initial size of 1024 characters, then the reallocation would not happen as early or as often.
Optimization
Either way, this is probably a premature optimization. Typically you would allocate a set number of bytes when you want to reduce the number of internal memory reallocations that get executed.
It is unlikely that this will be the application's bottleneck.
The idea is that it's only a buffer - not the whole of the data. It's a temporary resting spot for data as you read a chunk, process it (possibly writing it somewhere else). So, allocate yourself a big enough "chunk" and it normally won't be a problem.
What problem are you anticipating?