Java heap profiling seems to freeze during large file ingestion

Java heap profiling seems to freeze during large file ingestion - java

I am trying to profile a java 7 application running on a redhat machine. When I run it as follows:
java -agentlib:hprof=cpu=samples,depth=10,monitor=y,thread=y ...
a particular block of code that creates a table-type object by reading a very large gzipped text file line by line completes in about 6 minutes. When I run it like this:
java -agentlib:hprof=heap=sites,depth=10,monitor=y,thread=y ...
the same block takes several orders of magnitude more time to complete (I am estimating something like 24 hours)
Here's the method (part of a class) that reads in the file:
private static void ingestValues()
{
int mSize = 30000;
pairScoresTable = new float[mSize][];
for (int i = 0; i < pairScoresTable.length; i++) {
pairScoresTable[i] = new float[mSize];
Arrays.fill(pairScoresTable[i], fillVal);
}
try
{
BufferedReader bufferedReader =
new BufferedReader(
new InputStreamReader(
new GZIPInputStream(
new FileInputStream(rawPath)), "US-ASCII"));
String line = null;
while((line = bufferedReader.readLine()) != null) { // file has 388661141 lines
Float value = 0.0F;
Integer i = 0;
Integer j = 0;
// extract value, i and j by parsing line...
pairScoresTable[i][j] = value;
pairScoresTable[j][i] = value;
}
bufferedReader.close();
}
catch(Exception e)
{
return;
}
return;
}
So it basically reads a file each line of which is formatted to describe the position of a value in the 2d matrix pairScoresTable.
Why is there such a large difference in execution time? Is there a way to do heap profiling of this code faster without having to refactor it?

Related

Java: Improving speed of a reader program

Hey so I am working on this program that reads CSV files and I need to make a method which can return one entire column on values.
Currently I do it like this:
List<String> data = new LinkedList<>();
for(int i = 0; i < getRowCount(); i++){
data.add(getRow(i).get(column));
}
Where getRow() is this:
List<String> data = new LinkedList<>();
String column;
try (BufferedReader bufferedReader = new BufferedReader(new FileReader(file))) {
for(int i = 0; i < row; i++){
bufferedReader.readLine();
}
column = bufferedReader.readLine();
for(String col: column.split(columnSeparator.toString())){
data.add(col);
}
} catch (IOException e) {
e.printStackTrace();
}
and it works. But the flaw is, if there are too many columns in a file it takes way too long. It takes 27 secondso n 7500 lines and 9 columns. Over 10 minutes on 35000 lines and 16 columns. Do you know how could I make it faster?

Try to read the file once:
List<String> getColumn(int column) {
try (BufferedReader bufferedReader = new BufferedReader(new FileReader(file))) {
List<String> data = new LinkedList<>();
String line = bufferedReader.readLine();
while (line != null) {
String cols[] = line.split(columnSeparator.toString());
data.add(cols[column]);
line = bufferedReader.readLine();
}
return data;
} catch (IOException e) {
e.printStackTrace();
return null;
}
}

What you are doing is the following:
Prepare to read the file (Creating ReaderObject, ...), read first line
Prepare to read the file, read first line, read second line
Preapre to read the file, read first line, read second line, read third line
.. And so on.
Apparently this is not very efficient (Your doing stuff in O(n²), with n = number of lines).
You could improve your code vastly, if you do it something like this:
Prepare to read the file
Read the first line
Read the second line
... And so on.
So first read all the lines at once:
List<String> lines = new LinkedList<>();
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != null)
lines.add(line);
} catch (IOException e) {
e.printStackTrace();
}
You can then iterate over the lines to split them into columns and extract the data you're interested in:
List<String> data = new LinkedList<>();
for(String line : lines)
data.add(line.split(columnSeparator.toString())[column]);
Of course this still needs a little bit of error handling :)

I would suggest you to try this
DataType<T> listRef = getRowCount();
for(int i = 0; i < listRef.size(); i++)
{
data.add(getRow(i).get(column));
}
getRowCount is executed every single time when you call it in a for statement and you would eventually get all the rows but internally I believe that calling makes it go and execute that method getRowCount().size() times and you probably don't want to read a file that many times

GWT: reading from a CSV file on the Server

I try to convert an old Applet to a GWT Application but I encountered a problem with the following function:
private String[] readBrandList() {
try {
File file = new File("Brands.csv");
String ToAdd = "Default";
BufferedReader read = new BufferedReader(new FileReader(file));
ArrayList<String> BrandName = new ArrayList<String>();
while (ToAdd != null) {
ToAdd = (read.readLine());
BrandName.add(ToAdd);
}
read.close();
String[] BrandList = new String[BrandName.size()];
for (int Counter = 0; Counter < BrandName.size(); Counter++) {
BrandList[Counter] = BrandName.get(Counter);
}
return BrandList;
} catch (Exception e) {
}
return null;
}
Now apparently The BufferedReader isn't supported by GWT and I find no way to replace it other than writing all entries into the code which would result in a maintenance nightmare.
Is there any function I'm not aware of or is it just impossible?

You need to read this file on the server side of your app, and then pass the results to the client using your preferred server-client communication method. You can read and pass the entire file, if it's small, or read/transfer in chunks if the file is big.

How to run tshark in Java to get packets in real-time?

I have a problem with running tshark in Java. It seems that packets arrive in bulk instead of truly real-time (as it happens when run from terminal).
I tried a few different approaches:
ArrayList<String> command = new ArrayList<String>();
command.add("C:\\Program Files\\Wireshark\\tshark.exe");
ProcessBuilder pb = new ProcessBuilder(command);
Process process = pb.start();
BufferedReader br = null;
try {
//tried different numbers for BufferedReader's last parameter
br = new BufferedReader(new InputStreamReader(process.getInputStream()), 1);
String line = null;
while ((line = br.readLine()) != null) {
System.out.println(line);
}
} catch...
also tried using InputStream's available() method as seen in What does InputStream.available() do in Java?
I also tried NuProcess library with the following code:
NuProcessBuilder pb = new NuProcessBuilder(command);
ProcessHandler processHandler = new ProcessHandler();
pb.setProcessListener(processHandler);
NuProcess process = pb.start();
try {
process.waitFor(0, TimeUnit.SECONDS);
} catch (InterruptedException e) {
e.printStackTrace();
}
private class ProcessHandler extends NuAbstractProcessHandler {
private NuProcess nuProcess;
#Override
public void onStart(NuProcess nuProcess) {
this.nuProcess = nuProcess;
}
#Override
public void onStdout(ByteBuffer buffer) {
if (buffer == null)
return;
byte[] bytes = new byte[buffer.remaining()];
buffer.get(bytes);
System.out.println(new String(bytes));
}
}
None of the methods work. Packets always arrive, as if buffered, only when about 50 were sniffed.
Do you have any idea why this may be happening and how to solve it? It's pretty frustrating. I spent a lot of time looking at similar questions at SO, but none of them helped.
Do you see any errors in my code? Is it working in your case?

As the tshark man page says:
−l Flush the standard output after the information for each packet is
printed. (This is not, strictly speaking, line‐buffered if −V was
specified; however, it is the same as line‐buffered if −V wasn’t
specified, as only one line is printed for each packet, and, as −l
is normally used when piping a live capture to a program or script,
so that output for a packet shows up as soon as the packet is seen
and dissected, it should work just as well as true line‐buffering.
We do this as a workaround for a deficiency in the Microsoft Visual
C++ C library.)
This may be useful when piping the output of TShark to another
program, as it means that the program to which the output is piped
will see the dissected data for a packet as soon as TShark sees the
packet and generates that output, rather than seeing it only when
the standard output buffer containing that data fills up.
Try running tshark with the -l command-line argument.

I ran some tests to see how much Buffering would be done by BufferedReader versus just using the input stream.
ProcessBuilder pb = new ProcessBuilder("ls", "-lR", "/");
System.out.println("pb.command() = " + pb.command());
Process p = pb.start();
byte ba[] = new byte[100];
InputStream is = p.getInputStream();
int bytecountsraw[] = new int[10000];
long timesraw[] = new long[10000];
long last_time = System.nanoTime();
for (int i = 0; i < timesraw.length; i++) {
int bytecount = is.read(ba);
long time = System.nanoTime();
timesraw[i] = time - last_time;
last_time = time;
bytecountsraw[i] = bytecount;
}
try (PrintWriter pw = new PrintWriter(new FileWriter("dataraw.csv"))) {
pw.println("bytecount,time");
for (int i = 0; i < timesraw.length; i++) {
pw.println(bytecountsraw[i] + "," + timesraw[i] * 1.0E-9);
}
} catch (Exception e) {
e.printStackTrace();
}
BufferedReader br = new BufferedReader(new InputStreamReader(is));
int bytecountsbuffered[] = new int[10000];
long timesbuffered[] = new long[10000];
last_time = System.nanoTime();
for (int i = 0; i < timesbuffered.length; i++) {
String str = br.readLine();
int bytecount = str.length();
long time = System.nanoTime();
timesbuffered[i] = time - last_time;
last_time = time;
bytecountsbuffered[i] = bytecount;
}
try (PrintWriter pw = new PrintWriter(new FileWriter("databuffered.csv"))) {
pw.println("bytecount,time");
for (int i = 0; i < timesbuffered.length; i++) {
pw.println(bytecountsbuffered[i] + "," + timesbuffered[i] * 1.0E-9);
}
} catch (Exception e) {
e.printStackTrace();
}
I tried to find a command that would just keep printing as fast as it could so that any delays would be due to the buffering and/or ProcessBuilder rather than in the command itself. Here is a plot of the result.
You can plot the csv files with excel although I used a Netbeans plugin called DebugPlot. There wasn't a great deal of difference between the raw and the buffered. Both were bursty with majority of reads taking less than a microsecond separated by peaks of 10 to 50 milliseconds. The scale of the plot is in nanoseconds so the top of 5E7 is 50 milliseconds or 0.05 seconds. If you test and get similar results perhaps it is the best process builder can do. If you get dramatically worse results with tshark than other commands, perhaps there is an option to tshark or the packets themselves are coming in bursts.

Reading a large text file faster

I'm trying to read a large text file as fast as possible.
Lines not beginning with '!' are passed over.
Lines with 8 CSV have their last value removed.
There will never be a ',' in a value (didn't need to use opencsv).
Everything is added to a long string that is decoded later.
So this is my code
BufferedReader br = new BufferedReader(new FileReader("C:\\Users\\Documents\\ais_messages1.3.txt"));
String line, aisLines="", cvsSplitBy = ",";
try {
while ((line = br.readLine()) != null) {
if(line.charAt(0) == '!') {
String[] cols = line.split(cvsSplitBy);
if(cols.length>=8) {
line = "";
for(int i=0; i<cols.length-1; i++) {
if(i == cols.length-2) {
line = line + cols[i];
} else {
line = line + cols[i] + ",";
}
}
aisLines += line + "\n";
} else {
aisLines += line + "\n";
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
So right now it reads 36890 rows in 14 seconds. I also tried an InputStreamReader:
InputStreamReader isr = new InputStreamReader(new FileInputStream("C:\\Users\\Documents\\ais_messages1.3.txt"));
BufferedReader br = new BufferedReader(isr);
and it took the same amount of time. Is there a faster way to read a large text file (100,000 or 1,000,000 rows) ?

Stop trying to build up aisLines as a big String. Use an ArrayList<String> that you append the lines on to. That takes 0.6% the time as your method on my machine. (This code processes 1,000,000 simple lines in 0.75 seconds.) And it will reduce the effort needed to process the data later, as it'll already be split up by lines.
BufferedReader br = new BufferedReader(new FileReader("data.txt"));
List<String> aisLines = new ArrayList<String>();
String line, cvsSplitBy = ",";
try {
while ((line = br.readLine()) != null) {
if(line.charAt(0) == '!') {
String[] cols = line.split(cvsSplitBy);
if(cols.length>=8) {
line = "";
for(int i=0; i<cols.length-1; i++) {
if(i == cols.length-2) {
line = line + cols[i];
} else {
line = line + cols[i] + ",";
}
}
aisLines.add(line);
} else {
aisLines.add(line);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
If you really want a big String at the end (because you're interfacing with someone else's code, or whatever), it'll still be faster to convert the ArrayList back into a single string, than to do what you were doing.

As the most consuming operation is IO the most efficient way is to split threads for parsing and reading:
private static void readFast(String filePath) throws IOException, InterruptedException {
ExecutorService executor = Executors.newWorkStealingPool();
BufferedReader br = new BufferedReader(new FileReader(filePath));
List<String> parsed = Collections.synchronizedList(new ArrayList<>());
try {
String line;
while ((line = br.readLine()) != null) {
final String l = line;
executor.submit(() -> {
if (l.charAt(0) == '!') {
parsed.add(parse(l));
}
});
}
} catch (IOException e) {
e.printStackTrace();
}
executor.shutdown();
executor.awaitTermination(1000, TimeUnit.MINUTES);
String result = parsed.stream().collect(Collectors.joining("\n"));
}
For my pc it has taken 386ms vs 10787ms with the slow one

You can use single thread reads your large csv file and multiple threads parse all lines. The way I do is using Producer-Consumer pattern and BlockingQueue.
Producer
Making one Producer Thread which is only responsible for reading the lines of your csv file, and stores lines into BlockingQueue. The producer side does not do anything else.
Consumers
Making multiple Consumer Threads, pass the same BlockingQueue object into your consumers. Implementing time consuming work in your Consumer Thread class.
The following code provide you an idea of solving problem, not the solution.
I was implemented this using python and it works much faster than using a single thread do everything. The language is not java, but the theory behind is the same.
import multiprocessing
import Queue
QUEUE_SIZE = 2000
def produce(file_queue, row_queue,):
while not file_queue.empty():
src_file = file_queue.get()
zip_reader = gzip.open(src_file, 'rb')
try:
csv_reader = csv.reader(zip_reader, delimiter=SDP_DELIMITER)
for row in csv_reader:
new_row = process_sdp_row(row)
if new_row:
row_queue.put(new_row)
finally:
zip_reader.close()
def consume(row_queue):
'''processes all rows, once queue is empty, break the infinit loop'''
while True:
try:
# takes a row from queue and process it
pass
except multiprocessing.TimeoutError as toe:
print "timeout, all rows have been processed, quit."
break
except Queue.Empty:
print "all rows have been processed, quit."
break
except Exception as e:
print "critical error"
print e
break
def main(args):
file_queue = multiprocessing.Queue()
row_queue = multiprocessing.Queue(QUEUE_SIZE)
file_queue.put(file1)
file_queue.put(file2)
file_queue.put(file3)
# starts 3 producers
for i in xrange(4):
producer = multiprocessing.Process(target=produce,args=(file_queue,row_queue))
producer.start()
# starts 1 consumer
consumer = multiprocessing.Process(target=consume,args=(row_queue,))
consumer.start()
# blocks main thread until consumer process finished
consumer.join()
# prints statistics results after consumer is done
sys.exit(0)
if __name__ == "__main__":
main(sys.argv[1:])

how to read from a huge file and write to a new file by java

What I am doing is to read one file line by line, format every line, then write to a new file. But the problem is that the file is huge, nearly 178 MB. But always getting error message: IO console updater error, java heap space. Here is my code:
public class fileFormat {
public static void main(String[] args) throws IOException{
String strLine;
FileInputStream fstream = new FileInputStream("train_final.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(fstream));
BufferedWriter writer = new BufferedWriter(new FileWriter("newOUTPUT.txt"));
while((strLine = reader.readLine()) != null){
List<String> numberBox = new ArrayList<String>();
StringTokenizer st = new StringTokenizer(strLine);
while(st.hasMoreTokens()){
numberBox.add(st.nextToken());
}
for (int i=1; i< numberBox.size(); i++){
String head = numberBox.get(0);
String tail = numberBox.get(i);
String line = head + " "+tail ;
System.out.println(line);
writer.write(line);
writer.newLine();
}
numberBox.clear();
}
reader.close();
writer.close();
}
}
How can I avoid this error message? Moreover, I have set the VM preference: -xms1024m

Remove the line
System.out.println(line);
This is a workaround the fialing console updater, which otherwise runs out of memory.

The program looks okay. I suspect the problem is that you run this inside of Eclipse, and System.out is collected by Eclipse in memory (to be displayed in that Console window).
System.out.println(line);
Try to run it outside of Eclipse, change Eclipse settings to pipe System.out somewhere, or remove the line.

This part of the code:
for (int i=1; i< numberBox.size(); i++){
String head = numberBox.get(0);
String tail = numberBox.get(i);
String line = head + " "+tail ;
System.out.println(line);
writer.write(line);
writer.newLine();
}
Can be translated to:
String head = numberBox.get(0);
for (int i=1; i< numberBox.size(); i++){
String tail = numberBox.get(i);
System.out.print(head);
System.out.print(" ");
System.out.println(tail);
writer.write(head);
writer.write(" ");
writer.write(tail);
writer.newLine();
}
This may add a little code duplication but it avoids creating a lot of objects.
Also there if you merge this for loop with the loop contructing the numberBox, you won't need numberBox structure at all.

If you read whole file the heap memory will occupy so better option in to read the file in chuck. See my below code. It will start reading from the offset given in argument and will return the end offset . You need to pass number of lines to be read.
Please remember: You can use any collection to store these read lines and clear the collection before calling this method to read next chunk.
FileInputStream fis = new FileInputStream(file);
InputStreamReader streamReader = new InputStreamReader(fis, "UTF-8");
LineNumberReader reader = new LineNumberReader(streamReader);
//call this below method recursively until the file does not reaches to the end
public int getParsedLines(LineNumberReader reader, int iLineNumber_Start, int iNumberOfLinesToBeRead) {
int iLineNumber_End = 0;
int iReadUptoLines = iLineNumber_Start + iNumberOfLinesToBeRead;
try {
reader.mark(iLineNumber_Start);
reader.setLineNumber(iLineNumber_Start);
do {
String str = reader.readLine();
if (str == null) {
break;
}
// your code
iLineNumber_End = reader.getLineNumber();
} while (iLineNumber_End != iReadUptoLines);
} catch (Exception ex) {
// exception handling
}
return iLineNumber_End;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java heap profiling seems to freeze during large file ingestion - java

Related

Java: Improving speed of a reader program

GWT: reading from a CSV file on the Server

How to run tshark in Java to get packets in real-time?

Reading a large text file faster

how to read from a huge file and write to a new file by java

Categories

Resources