File read progress - java

I am using java to read a TSV file that is 4gb in size and i wanted to know if there is a way for java to tell me how far it is through the task as the program is running. I'm thinking file stream might be able to tell me how many bytes it has read and i could do some simple math with that.

A plain stream or reader doesn't count the number of bytes / characters read.
I think you might be looking for ProgressMonitorInputStream.
If you don't want / need the Swing integration, then another alternative is
to write a custom subclass of FilterReader or FilterInputStream that counts the characters/bytes read and provides a getter for reading the count. Then put the custom class into your input stack at the appropriate point.

As you read from the stream, keep a tally of bytes read. For example, if you are reading byte arrays directly from the stream:
long bytesReadTotal = 0L;
int bytesRead = stream.read(bytes);
while (bytesRead != -1) {
bytesReadTotal += bytesRead;
// process these bytes ...
bytesRead = stream.read(bytes)
}

If you read this file through HTTP, there is a header named "Content-Length" can tell you the total number of bytes you should read, then you know the progress while you are reading.
If you read the file through TCP/UDP, I guess you should write both the client and the server for file transferring, then you should send the file length first to the client, then read the file.
If you just read a local file, this is not a problem.

Related

Java sound: JAR file truncates .wav [duplicate]

First some background.
Its not needed to answer the actual question, but maybe it'll help put things in perspective.
I have written an mp3 library in java (h) which reads out the information stored in the ID3 tag in an .mp3 file. Information about the song like the name of the song, the CD the song was released on, the track number, etc. are stored in this ID3 tag right at the beginning of an .mp3 file.
I have tested the library on 12,579 mp3 files which are located on my local hard drive, and it works flawlessly. Not a single IO error.
When I perform the same thing where the mp3 files are located on a web server, I get an IO error. Well, not actually an error. Actually its a difference in the behavior of the InputStream's read(byte[]) method.
The example below will illustrate the problem, which occurs when I'm trying to read an image file (.jpg, .gif, .png, etc) from the mp3 file.
// read bytes from an .mp3 file on your local hard drive
// reading from an input stream created this way works flawlessly
InputStream inputStream = new FileInputStream("song.mp3");
// read bytes from an .mp3 file given by a url
// reading from an input stream created this way fails every time.
URL url = "http://localhost/song.mp3");
HttpURLConnection httpConnection = (HttpURLConnection)url.openConnection();
httpConnection.connect();
InputStream inputStream = url.openStream();
int size = 25000; // size of the image file
byte[] buffer = new byte[size];
int numBytesRead = inputStream.read(buffer);
if (numBytesRead != buffer.length)
throw new IOException("Error reading the bytes into the buffer. Expected " + buffer.length + " bytes but got " + numBytesRead + " bytes");
So, my observation is:
Calling inputStream.read(buffer); always reads the entire number of bytes when the input stream is a FileInputStream. But it only reads a partial amount when I am using an input stream obtained from an http connection.
And hence my question is:
In general, can I not assume that the InputStream's read(byte[]) method will block until the entire number of bytes has been read (or EOF is reached)?
That is, have I assumed behavior that is not true of the read(byte[]) method, and I've just gotten lucky working with FileInputStream?
Is the correct, and general behavior of InputStream.read(byte[]) that I need to put the call in a loop and keep reading bytes until the desired number of bytes have been read, or EOF has been reached? Something like the code below:
int size = 25000;
byte[] buffer = new byte[size];
int numBytesRead = 0;
int totalBytesRead = 0;
while (totalBytesRead != size && numBytesRead != -1)
{
numBytesRead = inputStream.read(buffer);
totalBytesRead += numBytesRead
}
Your conclusions are sound, take a look at the documentation for InputStream.read(byte[]):
Reads some number of bytes from the input stream and stores them into
the buffer array b. The number of bytes actually read is returned as
an integer. This method blocks until input data is available, end of
file is detected, or an exception is thrown.
There is no guarantee that read(byte[]) will fill the array you have provided, only that it will either read at least 1 byte (provided your array's length is > 0), or it will return -1 to signal the EOS. This means that if you want to read bytes from an InputStream correctly, you must use a loop.
The loop you currently have has one bug in it. On the first iteration of the loop, you will read a certain number of bytes into your buffer, but on the second iteration you will overwrite some, or all, of those bytes. Take a look at InputStream.read(byte[], int, int).
And hence my question is: In general, can I not assume that the InputStream's read(byte[]) method will block until the entire number of bytes has been read (or EOF is reached)?
No. That's why the documentation says "The number of bytes actually read" and "there is an attempt to read at least one byte."
I need to put the call in a loop and keep reading bytes until the desired number of bytes have been read
Rather than reinvent the wheel, you can get an already-tested wheel at Jakarta Commons IO.

Resource file format processing in Java

I am trying to implement a processor for a specific resource archive file format in Java. The format has a Header comprised of a three-char description, a dummy byte, plus a byte indicating the number of files.
Then each file has an entry consisting of a dummy byte, a twelve-char string describing the file name, a dummy byte, and an offset declared in a three-byte array.
What would be the proper class for reading this kind of structure? I have tried RandomAccessFile but it does not allow to read arrays of data, e.g. I can only read three chars by calling readChar() three times, etc.
Of course I can extend RandomAccessFile to do what I want but there's got to be a proper out-of-the-box class to do this kind of processing isn't it?
This is my reader for the header in C#:
protected override void ReadHeader()
{
Header = new string(this.BinaryReader.ReadChars(3));
byte dummy = this.BinaryReader.ReadByte();
NFiles = this.BinaryReader.ReadByte();
}
I think you got lucky with your C# code, as it relies on the character encoding to be set somewhere else, and if it didn't match the number of bytes per character in the file, your code would probably have failed.
The safest way to do this in Java would be to strictly read bytes and do the conversion to characters yourself. If you need seek abilities, then indeed RandomAccessFile would be your easiest solution, but it should be pointed out that InputStream allows skipping, so if you don`t need actual random access, just to skip some of the files, you could certainly use it.
In either case, you should read the bytes from the file per the file specification, and then convert them to characters based on a known encoding. You should never trust a file that was not written by a Java program to contain any Java data types other than byte, and even if it was written by Java, it may well have been converted to raw bytes while writing.
So your code should be something along the lines of:
String header = "";
int nFiles = 0;
RandomAccessFile raFile = new RandomAccessFile( "filename", "r" );
byte[] buffer = new byte[3];
int numRead = raFile.read( buffer );
header = new String( buffer, StandardCharsets.US_ASCII.name() );
int numSkipped = raFile.skipBytes(1);
nFiles = raFile.read(); // The byte is read as an integer between 0 and 255
Sanity checks (checking that actual 3 bytes were read, 1 byte was skipped and nFiles is not -1) and exception handling have been skipped for brevity.
It's more or less the same if you use InputStream.
I would go with MappedByteBuffer. This will allow you to seek arbitrarily, but will also deal efficiently and transparently with large files that are too large to fit comfortably in RAM.
This is, to my mind, the best way of reading structured binary data like this from a file.
You can then build your own data structure on top of that, to handle the specific file format.

ReadFully() Comes at the risk of choking?

I noticed when I use readFully() on a file instead of the read(byte[]), processing time is reduced greatly. However, it occured to me that readFully may be a double edged sword. If I accidentlly try to read in a huge, multi-gigabyte file, it could choke?
Here is a function I am using to generate an SHA-256 checksum:
public static byte[] createChecksum(File log, String type) throws Exception {
DataInputStream fis = new DataInputStream(new FileInputStream(log));
Long len = log.length();
byte[] buffer = new byte[len.intValue()];
fis.readFully(buffer); // TODO: readFully may come at the risk of
// choking on a huge file.
fis.close();
MessageDigest complete = MessageDigest.getInstance(type);
complete.update(buffer);
return complete.digest();
}
If I were to instead use:
DataInputStream fis = new DataInputStream(new BufferedInputStream(new FileInputStream(log)));
Would that allieviate this risk? Or... is the best option (in situations where you can't garuntee data size) to always control the amount of bytes read in and use a loop till all bytes are read?
(Come to think of it, since the MessageDigest API takes in the full byte array at once, I'm not sure how to attain a checksum without stuffing all the data in at once, but I suppose that is another question for another thread.
You should just allocate a decently-sized buffer (65536 bytes perhaps), and do a loop where you read 64kb at a time, using "complete.update()" to append to the digester inside the loop. Be careful on the last block so you only process the number of bytes read (probably less than 64kb)
Reading the file will take as long as it takes, whether you use readFully() or not.
Whether you can actually allocate gigabyte-sized byte arrays is another question. There is no need to use readFully() at all when downloading files. It's for use in wire protocols where say the next 12 bytes are an identifier followed by another 60 bytes of address information and you don't want to have to keep writing loops.
readFully() isn't going to choke if the file is multiple gigabytes, but allocating that byte buffer will. You'll get an out-of-memory exception before you ever get to the call to readFully().
You need to use the method of updating the hash with chunks of the file repeatedly, rather than updating it all at once with the entire file.

Java file IO truncated while reading large files using BufferedInputStream

I have a function in which I am only given a BufferedInputStream and no other information about the file to be read. I unfortunately cannot alter the method definition as it is called by code I don't have access to. I've been using the code below to read the file and place its contents in a String:
public String[] doImport(BufferedInputStream stream) throws IOException, PersistenceException {
int bytesAvail = stream.available();
byte[] bytesRead = new byte[bytesAvail];
stream.read(bytesRead);
stream.close();
String fileContents = new String(bytesRead);
//more code here working with fileContents
}
My problem is that for large files (>2Gb), this code causes the program to either run extremely slowly or truncate the data, depending on the computer the program is executed on. Does anyone have a recommendation regarding how to deal with large files in this situation?
You're assuming that available() returns the size of the file; it does not. It returns the number of bytes available to be read, and that may be any number less than or equal to the size of the file.
Unfortunately there's no way to do what you want in just one shot without having some other source of information on the length of the file data (i.e., by calling java.io.File.length()). Instead, you have to possibly accumulate from multiple reads. One way is by using ByteArrayOutputStream. Read into a fixed, finite-size array, then write the data you read into a ByteArrayOutputStream. At the end, pull the byte array out. You'll need to use the three-argument forms of read() and write() and look at the return value of read() so you know exactly how many bytes were read into the buffer on each call.
I'm not sure why you don't think you can read it line-by-line. BufferedInputStream only describes how the underlying stream is accessed, it doesn't impose any restrictions on how you ultimately read data from it. You can use it just as if it were any other InputStream.
Namely, to read it line-by-line you could do
InputStreamReader streamReader = new InputStreamReader(stream);
BufferedInputReader lineReader = new BufferedInputReader(streamReader);
String line = lineReader.readLine();
...
[Edit] This response is to the original wording of the question, which asked specifically for a way to read the input file line-by-line.

Java InputStream's read(byte[]) method

First some background.
Its not needed to answer the actual question, but maybe it'll help put things in perspective.
I have written an mp3 library in java (h) which reads out the information stored in the ID3 tag in an .mp3 file. Information about the song like the name of the song, the CD the song was released on, the track number, etc. are stored in this ID3 tag right at the beginning of an .mp3 file.
I have tested the library on 12,579 mp3 files which are located on my local hard drive, and it works flawlessly. Not a single IO error.
When I perform the same thing where the mp3 files are located on a web server, I get an IO error. Well, not actually an error. Actually its a difference in the behavior of the InputStream's read(byte[]) method.
The example below will illustrate the problem, which occurs when I'm trying to read an image file (.jpg, .gif, .png, etc) from the mp3 file.
// read bytes from an .mp3 file on your local hard drive
// reading from an input stream created this way works flawlessly
InputStream inputStream = new FileInputStream("song.mp3");
// read bytes from an .mp3 file given by a url
// reading from an input stream created this way fails every time.
URL url = "http://localhost/song.mp3");
HttpURLConnection httpConnection = (HttpURLConnection)url.openConnection();
httpConnection.connect();
InputStream inputStream = url.openStream();
int size = 25000; // size of the image file
byte[] buffer = new byte[size];
int numBytesRead = inputStream.read(buffer);
if (numBytesRead != buffer.length)
throw new IOException("Error reading the bytes into the buffer. Expected " + buffer.length + " bytes but got " + numBytesRead + " bytes");
So, my observation is:
Calling inputStream.read(buffer); always reads the entire number of bytes when the input stream is a FileInputStream. But it only reads a partial amount when I am using an input stream obtained from an http connection.
And hence my question is:
In general, can I not assume that the InputStream's read(byte[]) method will block until the entire number of bytes has been read (or EOF is reached)?
That is, have I assumed behavior that is not true of the read(byte[]) method, and I've just gotten lucky working with FileInputStream?
Is the correct, and general behavior of InputStream.read(byte[]) that I need to put the call in a loop and keep reading bytes until the desired number of bytes have been read, or EOF has been reached? Something like the code below:
int size = 25000;
byte[] buffer = new byte[size];
int numBytesRead = 0;
int totalBytesRead = 0;
while (totalBytesRead != size && numBytesRead != -1)
{
numBytesRead = inputStream.read(buffer);
totalBytesRead += numBytesRead
}
Your conclusions are sound, take a look at the documentation for InputStream.read(byte[]):
Reads some number of bytes from the input stream and stores them into
the buffer array b. The number of bytes actually read is returned as
an integer. This method blocks until input data is available, end of
file is detected, or an exception is thrown.
There is no guarantee that read(byte[]) will fill the array you have provided, only that it will either read at least 1 byte (provided your array's length is > 0), or it will return -1 to signal the EOS. This means that if you want to read bytes from an InputStream correctly, you must use a loop.
The loop you currently have has one bug in it. On the first iteration of the loop, you will read a certain number of bytes into your buffer, but on the second iteration you will overwrite some, or all, of those bytes. Take a look at InputStream.read(byte[], int, int).
And hence my question is: In general, can I not assume that the InputStream's read(byte[]) method will block until the entire number of bytes has been read (or EOF is reached)?
No. That's why the documentation says "The number of bytes actually read" and "there is an attempt to read at least one byte."
I need to put the call in a loop and keep reading bytes until the desired number of bytes have been read
Rather than reinvent the wheel, you can get an already-tested wheel at Jakarta Commons IO.

Categories

Resources