Java compressor not reading file completely - java

We have an issue unzipping bz2 files in Java, whereby the input stream thinks it's finished after reading ~3% of the file.
We would welcome any suggestions for how to decompress and read large bz2 files which have to be processed line by line.
Here are the details of what we have done so far:
For example, a bz2 file is 2.09 GB in size and uncompressed it is 24.9 GB
The code below only reads 343,800 lines of the actual ~10 million lines the file contains.
Modifying the code to decompress the bz2 into a text file (FileInputStream straight into the CompressorInputStream) results in a file of ~190 MB - irrespective of the size of the bz2 file.
I have tried setting a buffer value of 2048 bytes, but this has no effect on the outcome.
We have executed the code on Windows 64 bit and Linux/CentOS both with the same outcome.
Could the buffered reader come to an empty, "null" line and cause the code to exit the while-loop?
import org.apache.commons.compress.compressors.*;
import java.io.*;
...
CompressorInputStream is = new CompressorStreamFactory()
.createCompressorInputStream(
new BufferedInputStream(
new FileInputStream(filePath)));
lineNumber = 0;
line = "";
br = new BufferedReader(
new InputStreamReader(is));
while ((line = br.readLine()) != null) {
this.processLine(line, ++lineNumber);
}
Even this code, which forces an exception when the end of the stream is reached, has exactly the same result:
byte[] buffer = new byte[1024];
int len = 1;
while (len == 1) {
out.write(buffer, 0, is.read(buffer));
out.flush();
}

There is nothing obviously wrong with your code; it should work. This means the problem must be elsewhere.
Try to enable logging (i.e. print the lines as you process them). Make sure there are no gaps in the input (maybe write the lines to a new file and do a diff). Use bzip2 --test to make sure the input file isn't buggy. Check whether it always fails for the same line (maybe the input contains odd characters or binary data?)

The issue lies with the bz2 files: they were created using a version of Hadoop which includes bad block headers inside the files.
Current Java solutions stumble over this, while others ignore it or handle it somehow.
Will look for a solution/workaround.

Related

csvWriter behave differently on unix machine (tomcat sever) for huge file (size 5000 KB) and it creates empty file,Same Code work fine at windows,WHY?

I am writing csv file with the help of csvWriter (Java) but while executing code on Unix Box with huge records (Around 9000) it creates empty file.
When i try to execute same code at local( Eclipse ) at windows it works fine for same huge file. WHY?
I Noticed one thing if record are around 3000 then it works fine at unix box also.
Issue is with only huge file.
I tried to use writer.writeNext() method also instead of writeAll() but still same issue is observed at UNIX Box. :(
Note : File does not has any special characters , It's in English.
Code -->
CSVReader reader = new CSVReader(new FileReader(inputFile), ',','"');
List<String[]> csvBody = reader.readAll();
int listSize = csvBody.size();
if(listSize > 0){
String renameFileNamePath = outputFolder + "//"+ existingFileName.replaceFirst("file1", "file2");
File newFile = new File(renameFileNamePath);
CSVWriter writer = new CSVWriter(new FileWriter(newFile), ',');
for(int row=1 ; row < listSize; row++){
String timeKeyOrTransactionDate = null;
timeKeyOrTransactionDate = year+"-"+month+"-"+day+" 00:00:00";
csvBody.get(row)[0] = timeKeyOrTransactionDate ;
}
//Write to CSV file which is open
writer.writeAll(csvBody);
writer.flush();
writer.close();
}
reader.close();
The readAll and writeAll methods should only be used with small datasets - otherwise avoid it like the plague. Use the readNext and writeNext methods instead so you don't have to read the entire file into memory.
Note the readNext will return null once you have no more data (end of Stream or end of file). I will have to update the javadocs to mention that.
Disclaimer - I am the maintainer of the opencsv project. So please take the "avoid like plague" seriously. Really that was only put there because most files are usually small and can fit in memory but when in doubt of how big your dataset will be avoid putting it all in memory.
A data error. The linux machine probably uses UTF-8 Unicode encoding. This can throw error on the first encountered malformed UTF-8 byte sequence, with the single byte Windows encoding simply accepts.
You are using the old utility class FileReader (there also exists the also flawed FileWriter), that use the default platform encoding, which makes the software platform dependent.
You need to do:
Charset charset = Charset.forName("Windows-1252"); // Windows Latin-1
For reading
BufferedReader br = Files.newBufferedReader(inputFile.toPath(), charset);
For writing
Path newFile = Paths.get(renameFileNamePath);
BufferedWriter bw = Files.newBufferedWriter(newFile, charset);
CSVWriter writer = new CSVWriter(bw, ',');
The above assumes a single byte encoding, but probably will work for most other single byte encodings too.
A pity that the file is not in UTF-8, allowing any script.
Issue has resolved. Actually output directory was shared via loader application also and loader keeps on checking file in every minutes that's why before writing the csv file ,loader pick it and load with zero kb in DB.
Hence I used buffered writer instead of file writer and also writing data first in tmp file then renamed it with file2 and it worked.
Thanks to all of you for your help and valuable suggestions.

Reading a large compressed file using Apache Commons Compress

I'm trying to read a bz2 file using Apache Commons Compress.
The following code works for a small file.
However for a large file (over 500MB), it ends after reading a few thousands lines without any error.
try {
InputStream fin = new FileInputStream("/data/file.bz2");
BufferedInputStream bis = new BufferedInputStream(fin);
CompressorInputStream input = new CompressorStreamFactory()
.createCompressorInputStream(bis);
BufferedReader br = new BufferedReader(new InputStreamReader(input,
"UTF-8"));
String line = "";
while ((line = br.readLine()) != null) {
System.out.println(line);
}
} catch (Exception e) {
e.printStackTrace();
}
Is there another good way to read a large compressed file?
I was having the same problem with a large file, until I noticed that CompressorStreamFactory has a couple of overloaded constructors that take a boolean decompressUntilEOF parameter.
Simply changing to the following may be all that's missing...
CompressorInputStream input = new CompressorStreamFactory(true)
.createCompressorInputStream(bis);
Clearly, whoever wrote this factory seems to think it's better to create new compressor input streams at certain points, with the same underlying buffered input stream so that the new one picks up where the last one left off. They seem to think that's a better default, or preferred way of doing it over allowing one stream to decompress data all the way to the end of the file. I've no doubt they are cleverer than me, and I haven't worked out what trap I'm setting for future me by setting this parameter to true. Maybe someone will tell me in the comments! :-)

Java reading from file and sending using DataOutputStream

I'm trying to write a mini FTP application that reads binary data from a file and sends it to a client. My program usually does not behave as desired and usually ends up sending the file, but not doing it completely (i.e. send text file and the content is blank). I think it may be because I use the filereader to read the line, although I do not quite understand why this would be a problem.
Here is the relevant code:
File file = new File(rootDirectory, name);
int filenum = (int)file.length();
long filelen = file.length();
System.out.println("File is: " + filenum + " bytes long");
socketOut.writeLong(filelen);
fileIn = new BufferedReader(new FileReader(file));
System.out.println("Sending: " + name);
while((line = fileIn.readLine()) != null){
socketOut.writeBytes(line);
socketOut.flush();
}
The problem is that Readers/writers read text (as opposed to Input~/OutputStreams). FileReader internally uses the default operating system encoding. That conversion will never do for binary files. Also note, that readLine discards the line ending (\r\n, \n or \u0085). As of Java 7 you can do
Files.copy(file.toPath(), socketOut);
instead of the wile loop.
Joop's solution is perfect for Java7 (or later). If you are stuck on an older version (or want to extend your tool arsenal anyway), have a look at the following free libraries:
Apache Commons IO (actually all Apache Commons are interesting to look at). There you can do IOUtils.copy(...)
Google Guava There it is a little more complicated but flexible. Use ByteSource.copyTo(ByteSink)
I like the caching in the Google libraries, pretty neat
If you don't have Java 7 and don't want to add external libraries, the canonical copy loop in Java for streams is as follows:
while ((count = in.read(buffer)) > 0)
{
out.write(buffer, 0, count);
}
where count is an int, and buffer is a byte[] of any non-zero size. It doesn't have to be anywhere near the size of the file. I usually use 8192.

CSV file validation with Java

I'm reading a file line by line, like this:
FileReader myFile = new FileReader(File file);
BufferedReader InputFile = new BufferedReader(myFile);
// Read the first line
String currentRecord = InputFile.readLine();
while(currentRecord != null) {
currentRecord = InputFile.readLine();
}
But if other types of files are uploaded, it will still read their contents. For instance, if the uploaded file is an image, it will output junk characters when reading the file. So my question is: how can I check the file is CSV for sure before reading it?
Checking extension of the file is kind of lame since someone can upload a file that is not CSV but has a .csv extension. Thanks in advance.
Determining the MIME type of a file is not something easy to do, especially if ASCII sections can be mixed with binary ones.
Actually, when you look at how a java mail system does determine the MIME type of an email, it does involve reading all bytes in it, and applying some "rules".
Check out MimeUtility.java
If the primary type of this datasource is "text" and if all the bytes in its input stream are US-ASCII, then the encoding is "7bit".
If more than half of the bytes are non-US-ASCII, then the encoding is "base64".
If less than half of the bytes are non-US-ASCII, then the encoding is "quoted-printable".
If the primary type of this datasource is not "text", then if all the bytes of its input stream are US-ASCII, the encoding is "7bit".
If there is even one non-US-ASCII character, the encoding is "base64".
#return "7bit", "quoted-printable" or "base64"
As mentioned by mmyers in a deleted comment, JavaMimeType is supposed to do the same thing, but:
it is dead since 2006
it does involve reading the all content!
:
File file = new File("/home/bibi/monfichieratester");
InputStream inputStream = new FileInputStream(file);
ByteArrayOutputStream byteArrayStream = new ByteArrayOutputStream();
int readByte;
while ((readByte = inputStream.read()) != -1) {
byteArrayStream.write(readByte);
}
String mimetype = "";
byte[] bytes = byteArrayStream.toByteArray();
MagicMatch m = Magic.getMagicMatch(bytes);
mimetype = m.getMimeType();
So... since you are reading the all content of the file anyway, you could take advantage of that to determine the type based on that content and your own rules.
Java Mime Magic may be of use. It'll analyse mime-types from files and inputstreams. I can't vouch for it's functionality, however.
This link may provide further info. It provides several different means of determining how to do what you want (or at least something similar).
I would perhaps be tempted to write something specific to your problem domain. e.g. determining the number of comma-separated values per line and rejecting if it's not within certain limits. Then split on the commas and parse each entry according to requirements (e.g. are they doubles/floats/valid Strings - and if strings, what encoding). I think you may have to do this anyway, given that someone may upload a file that starts like a CSV but is corrupted half-way through.

Corrupt file when using Java to download file

This problem seems to happen inconsistently. We are using a java applet to download a file from our site, which we store temporarily on the client's machine.
Here is the code that we are using to save the file:
URL targetUrl = new URL(urlForFile);
InputStream content = (InputStream)targetUrl.getContent();
BufferedInputStream buffered = new BufferedInputStream(content);
File savedFile = File.createTempFile("temp",".dat");
FileOutputStream fos = new FileOutputStream(savedFile);
int letter;
while((letter = buffered.read()) != -1)
fos.write(letter);
fos.close();
Later, I try to access that file by using:
ObjectInputStream keyInStream = new ObjectInputStream(new FileInputStream(savedFile));
Most of the time it works without a problem, but every once in a while we get the error:
java.io.StreamCorruptedException: invalid stream header: 0D0A0D0A
which makes me believe that it isn't saving the file correctly.
I'm guessing that the operations you've done with getContent and BufferedInputStream have treated the file like an ascii file which has converted newlines or carriage returns into carriage return + newline (0x0d0a), which has confused ObjectInputStream (which expects serialized data objects.
If you are using an FTP URL, the transfer may be occurring in ASCII mode.
Try appending ";type=I" to the end of your URL.
Why are you using ObjectInputStream to read it?
As per the javadoc:
An ObjectInputStream deserializes primitive data and objects previously written using an ObjectOutputStream.
Probably the error comes from the fact you didn't write it with ObjectOutputStream.
Try reading it wit FileInputStream only.
Here's a sample for binary ( although not the most efficient way )
Here's another used for text files.
There are 3 big problems in your sample code:
You're not just treating the input as bytes
You're needlessly pulling the entire object into memory at once
You're doing multiple method calls for every single byte read and written -- use the array based read/write!
Here's a redo:
URL targetUrl = new URL(urlForFile);
InputStream is = targetUrl.getInputStream();
File savedFile = File.createTempFile("temp",".dat");
FileOutputStream fos = new FileOutputStream(savedFile);
int count;
byte[] buff = new byte[16 * 1024];
while((count = is.read(buff)) != -1) {
fos.write(buff, 0, count);
}
fos.close();
content.close();
You could also step back from the code and check to see if the file on your client is the same as the file on the server. If you get both files on an XP machine, you should be able to use the FC utility to do a compare (check FC's help if you need to run this as a binary compare as there is a switch for that). If you're on Unix, I don't know the file compare program, but I'm sure there's something.
If the files are identical, then you're looking at a problem with the code that reads the file.
If the files are not identical, focus on the code that writes your file.
Good luck!

Categories

Resources