I have a very large file (~6GB) that has fixed-width text separated by \r\n, and so I'm using buffered reader to read line by line. This process can be interrupted or stopped and if it is, it uses a checkpoint "lastProcessedLineNbr" to fast forward to the correct place to resume reading. This is how the reader is initialized.
private void initializeBufferedReader(Integer lastProcessedLineNbr) throws IOException {
reader = new BufferedReader(new InputStreamReader(getInputStream(), "UTF-8"));
if(lastProcessedLineNbr==null){lastProcessedLineNbr=0;}
for(int i=0; i<lastProcessedLineNbr;i++){
reader.readLine();
}
currentLineNumber = lastProcessedLineNbr;
}
This seems to work fine, and I read and process the data in this method:
public Object readItem() throws Exception {
if((currentLine = reader.readLine())==null){
return null;
}
currentLineNumber++;
return parse(currentLine);
}
And again, everything works fine until I reach the last line in the document. readLine() in the latter method throws an error:
17:06:49,980 ERROR [org.jberet] (Batch Thread - 1) JBERET000007: Failed to run job ProdFileRead, parse, org.jberet.job.model.Chunk#3965dcc8: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
at java.lang.StringBuffer.append(StringBuffer.java:369)
at java.io.BufferedReader.readLine(BufferedReader.java:370)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at com.rational.batch.reader.TextLineReader.readItem(TextLineReader.java:55)
Curiously, it seems to be reading past the end of the file and allocating so much space that it runs out of memory. I tried looking at the contents of the file using Cygwin and "tail file.txt" and in the console it gave me the expected 10 lines. But when I did "tail file.txt > output.txt" output.txt ended up being like 1.8GB, much larger than the 10 lines I expected. So it seems Cygwin is doing the same thing. As far as I can tell there is no special EOF character. It's just the last byte of data and it ends abruptly.
Anyone have any idea on how I can get this working? I'm thinking I could resort to counting the number of bytes read until I get the full size of the file, but I was hoping there was a better way.
But when I did tail file.txt > output.txt output.txt ended up being like 1.8GB, much larger than the 10 lines I expected
What this indicates to me is that the file is padded with 1.8GB of binary zeroes, which Cygwin's tail command ignored when writing to the terminal, but which Java is not ignoring. This would explain your OutOfMemoryError as well, as the BufferedReader continued reading data looking for the next \r\n, never finding it before overflowing memory.
Related
My program executes system commands, and returns the output line by line, however, there are a couple of commands that produces lots of lines, in this case the RAM usage rises to ~700Mbs, knowing that the usual RAM usage in any other commands is 50-60Mbs.
This is the method that handles reading the output using BufferedReader, it is called by another method that handles the creation of the process of the command. it also passes the output line by line to showOutputLine() method, which will print it to the console or to a TextArea.
protected void formatStream(InputStream inputStream, boolean isError) {
bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
String tempLine = null;
// Read output
try {
while ((tempLine = bufferedReader.readLine()) != null) {
showOutputLine(tempLine, isError);
}
} catch (IOException e) {// just stop
}
}
one example of the commands that causes the issue:
adb logcat
EDIT: it appears BufferedReader is innocent, however, the problem still persists. caused by JTextArea.
BufferedReader always uses about 16 KB (8K * 2 byte chars) in a fixed size array. If you are using more than this it is a side effect of generating so many Strings (esp if you have really long lines of text) not the BufferedReader itself.
TextArea can retain much more memory usage depending on how long the text is.
In any case, the memory usage which really matters is the size of the heap after a Full GC, the rest is overhead of various kinds.
BTW Mb = Megi-bit, MB = Mega-byte.
I have a java program that is supposed to output data, take in data again, read and then output with a few extra columns of result. (So two outputs in total) To test my program I just tried to read and print out the exact same csv to see if it works. However, my first output returns 786718 rows of data, which is complete and correct, but when it gets read again to output the second time, the data is cut at row 786595 and even that row is missing some column data. The file size is also 74868KB vs 74072KB of data. Is this because of the lack of memory from my java program or excel/the .csv file's problem?
PrintWriter writer = null;
try {
writer = new PrintWriter(saveFileName + " updated.csv", "UTF-8");
for (Map.Entry<String, ArrayList> entry : readOutputCSV(saveFileName).entrySet()) {
FindOutput.find(entry.getKey(), entry.getValue(), checkInMRTWriter);
}
} finally {
if (writer != null) {
writer.flush();
writer.close();
}
}
The most likely reason is you are not flushing nor closing the PrintWriter.
From the Java source
public PrintWriter(OutputStream out) {
this(out, false);
}
public PrintWriter(OutputStream out, boolean autoFlush) {
this(new BufferedWriter(new OutputStreamWriter(out)), autoFlush);
You can see that PrintWriter is buffered by default.
The default buffer size is 8 KiB so if you leave this data in the buffer and don't write it out you can lose up to the last 8 KiB of your data.
Some things might influence here:
input/output encoding
line separators (you might be reading a file with '\r\n' and writing '\n' back
CSV escape - values might be escaped or not depending on how you are handling the special cases (values with newlines, comma, or quote). You might be reading valid CSV with a parser but printing out unescaped (and broken) CSV.
whitespaces. Some libraries clear the whitespace when parsing automatically.
The best way to verify is to use a CSV parsing library, such as univocity-parsers and use it to read/write your data with a fixed format configuration. Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
We have an issue unzipping bz2 files in Java, whereby the input stream thinks it's finished after reading ~3% of the file.
We would welcome any suggestions for how to decompress and read large bz2 files which have to be processed line by line.
Here are the details of what we have done so far:
For example, a bz2 file is 2.09 GB in size and uncompressed it is 24.9 GB
The code below only reads 343,800 lines of the actual ~10 million lines the file contains.
Modifying the code to decompress the bz2 into a text file (FileInputStream straight into the CompressorInputStream) results in a file of ~190 MB - irrespective of the size of the bz2 file.
I have tried setting a buffer value of 2048 bytes, but this has no effect on the outcome.
We have executed the code on Windows 64 bit and Linux/CentOS both with the same outcome.
Could the buffered reader come to an empty, "null" line and cause the code to exit the while-loop?
import org.apache.commons.compress.compressors.*;
import java.io.*;
...
CompressorInputStream is = new CompressorStreamFactory()
.createCompressorInputStream(
new BufferedInputStream(
new FileInputStream(filePath)));
lineNumber = 0;
line = "";
br = new BufferedReader(
new InputStreamReader(is));
while ((line = br.readLine()) != null) {
this.processLine(line, ++lineNumber);
}
Even this code, which forces an exception when the end of the stream is reached, has exactly the same result:
byte[] buffer = new byte[1024];
int len = 1;
while (len == 1) {
out.write(buffer, 0, is.read(buffer));
out.flush();
}
There is nothing obviously wrong with your code; it should work. This means the problem must be elsewhere.
Try to enable logging (i.e. print the lines as you process them). Make sure there are no gaps in the input (maybe write the lines to a new file and do a diff). Use bzip2 --test to make sure the input file isn't buggy. Check whether it always fails for the same line (maybe the input contains odd characters or binary data?)
The issue lies with the bz2 files: they were created using a version of Hadoop which includes bad block headers inside the files.
Current Java solutions stumble over this, while others ignore it or handle it somehow.
Will look for a solution/workaround.
I'm programming a little GUI for a file converter in java. The file converter writes its current progress to stdout. Looks like this:
Flow_1.wav: 28% complete, ratio=0,447
I wanted to illustrate this in a progress bar, so I'm reading the process' stdout like this:
ProcessBuilder builder = new ProcessBuilder("...");
builder.redirectErrorStream(true);
Process proc = builder.start();
InputStream stream = proc.getInputStream();
byte[] b = new byte[32];
int length;
while (true) {
length = stream.read(b);
if (length < 0) break;
// processing data
}
Now the problem is that regardless which byte array size I choose, the stream is read in chunks of 4 KB. So my code is being executed until length = stream.read(b); and then blocks for quite a while. Once the process generates 4 KB output data, my programm gets this chunk and works through it in 32 byte slices. And then waits again for the next 4 KB.
I tried to force java to use smaller buffers like this:
BufferedInputStream stream = new BufferedInputStream(proc.getInputStream(), 32);
Or this:
BufferedReader reader = new BufferedReader(new InputStreamReader(proc.getInputStream()), 32);
But neither changed anything.
Then I found this: Process source (around line 87)
It seems that the Process class is implemented in such a way that it pipes the process' stdout to a file. So what proc.getInputStream(); actually does, is returning a stream to a file. And this file seems to be written with a 4 KB buffer.
Does anyone know some kind of workaround for this situation? I just want to get the process' output instantly.
EDIT: As suggested by Ian Roberts, I also tried to pipe the converter's output into the stderr stream, since this stream doesn't seem to be wrapped in a BufferedInputStream. Still 4k chunks.
Another interesting thing is: I actually don't get exactly 4096 bytes, but about 5 more. I'm afraid the FileInputStream itself is buffered natively.
Looking at the code you linked to the process's standard output stream gets wrapped in a BufferedInputStream but its standard error remains unbuffered. So one possibility might be to execute not the converter directly, but a shell script (or Windows equivalent if you're on Windows) that sends the converter's stdout to stderr:
ProcessBuilder builder = new ProcessBuilder("/bin/sh", "-c",
"exec /path/to/converter args 1>&2");
Don't redirectErrorStream, and then read from proc.getErrorStream() instead of proc.getInputStream().
It may be the case that your converter is already using stderr for its progress reporting in which case you don't need the script bit, just turn off redirectErrorStream(). If the converter program writes to both stdout and stderr then you'll need to spawn a second thread to consume stdout as well (the script approach gets around this by sending everything to stderr).
I'm having a very difficult time debugging a problem with an application I've been building. The problem itself I cannot seem to reproduce with a representitive test program with the same issue which makes it difficult to demonstrate. Unfortunately I cannot share my actual source because of security, however, the following test represents fairly well what I am doing, the fact that the files and data are unix style EOL, writing to a zip file with a PrintWriter, and the use of StringBuilders:
public class Tester {
public static void main(String[] args) {
// variables
File target = new File("TESTSAVE.zip");
PrintWriter printout1;
ZipOutputStream zipStream;
ZipEntry ent1;
StringBuilder testtext1 = new StringBuilder();
StringBuilder replacetext = new StringBuilder();
// ensure file replace
if (target.exists()) {
target.delete();
}
try {
// open the streams
zipStream = new ZipOutputStream(new FileOutputStream(target, true));
printout1 = new PrintWriter(zipStream);
ent1 = new ZipEntry("testfile.txt");
zipStream.putNextEntry(ent1);
// construct the data
for (int i = 0; i < 30; i++) {
testtext1.append("Testing 1 2 3 Many! \n");
}
replacetext.append("Testing 4 5 6 LOTS! \n");
replacetext.append("Testing 4 5 6 LOTS! \n");
// the replace operation
testtext1.replace(21, 42, replacetext.toString());
// write it
printout1 = new PrintWriter(zipStream);
printout1.println(testtext1);
// save it
printout1.flush();
zipStream.closeEntry();
printout1.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
The heart of the problem is that the file I see at my side is producing a file of 16.3k characters. My friend, whether he uses the app on his pc or whether he looks at exactly the same file as me sees a file of 19.999k characters, the extra characters being a CRLF followed by a massive number of null characters. No matter what application, encoding or views I use, I cannot at all see these nul characters, I only see a single LF at the last line, but I do see a file of 20k. In all cases there is a difference between what is seen with the exact same files on the two machines even though both are windows machines and both are using the same editing softwares to view.
I've not yet been able to reproduce this behaviour with any amount of dummy programs. I have been able to trace the final line's stray CRLF to my use of println on the PrintWriter, however. When I replaced the println(s) with print(s + '\n') the problem appeared to go away (the file size was 16.3k). However, when I returned the program to println(s), the problem does not appear to return. I'm currently having the files verified by a friend in france to see if the problem really did go away (since I cannot see the nuls but he can), but this behaviour has be thoroughly confused.
I've also noticed that the StringBuilder's replace function states "This sequence will be lengthened to accommodate the specified String if necessary". Given that the stringbuilders setLength function pads with nul characters and that the ensureCapacity function sets capacity to the greater of the input or (currentCapacity*2)+2, I suspected a relation somewhere. However, I have only once when testing with this idea been able to get a result that represented what I've seen, and have not been able to reproduce it since.
Does anyone have any idea what could be causing this error or at least have a suggestion on what direction to take the testing?
Edit since the comments section is broken for me:
Just to clarify, the output is required to be in unix format regardless of the OS, hence the use of '\n' directly rather than through a formatter. The original StringBuilder that is inserted into is not in fact generated to me but is the contents of a file read in by the program. I'm happy the reading process works, as the information in it is used heavily throughout the application. I've done a little probing too and found that directly prior to saving, the buffer IS the correct capacity and that the output when toString() is invoked is the correct length (i.e. it contains no null characters and is 16,363 long, not 19,999). This would put the cause of the error somewhere between generating the string and saving the zip file.
Finally found the cause. Managed to reproduce the problem a few times and traced the cause down not to the output side of the code but the input side. My file reading function was essentially this:
char[] buf;
int charcount = 0;
StringBuilder line = new StringBuilder(2048);
InputStreamReader reader = new InputStreamReader(stream);// provides a line-wise read
BufferedReader file = new BufferedReader(reader);
do { // capture loop
try {
buf = new char[2048];
charcount = file.read(buf, 0, 2048);
} catch (IOException e) {
return null; // unknown IO error
}
line.append(buf);
} while (charcount != -1);
// close and output
problem was appending a buffer that wasnt full, so the later values were still at their initial values of null. Reason I couldnt reproduce it was because some data filled in the buffers nicely, some didn't.
Why I couldn't seem to view the problem on my text editors I still have no idea of, but I should be able to resolve this now. Any suggestions on the best way to do so are welcome, as this is part of one of my long term utility libraries I want to keep it as generic and optimised as possible.