Reading a large text file faster

Reading a large text file faster - java

I'm trying to read a large text file as fast as possible.
Lines not beginning with '!' are passed over.
Lines with 8 CSV have their last value removed.
There will never be a ',' in a value (didn't need to use opencsv).
Everything is added to a long string that is decoded later.
So this is my code
BufferedReader br = new BufferedReader(new FileReader("C:\\Users\\Documents\\ais_messages1.3.txt"));
String line, aisLines="", cvsSplitBy = ",";
try {
while ((line = br.readLine()) != null) {
if(line.charAt(0) == '!') {
String[] cols = line.split(cvsSplitBy);
if(cols.length>=8) {
line = "";
for(int i=0; i<cols.length-1; i++) {
if(i == cols.length-2) {
line = line + cols[i];
} else {
line = line + cols[i] + ",";
}
}
aisLines += line + "\n";
} else {
aisLines += line + "\n";
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
So right now it reads 36890 rows in 14 seconds. I also tried an InputStreamReader:
InputStreamReader isr = new InputStreamReader(new FileInputStream("C:\\Users\\Documents\\ais_messages1.3.txt"));
BufferedReader br = new BufferedReader(isr);
and it took the same amount of time. Is there a faster way to read a large text file (100,000 or 1,000,000 rows) ?

Stop trying to build up aisLines as a big String. Use an ArrayList<String> that you append the lines on to. That takes 0.6% the time as your method on my machine. (This code processes 1,000,000 simple lines in 0.75 seconds.) And it will reduce the effort needed to process the data later, as it'll already be split up by lines.
BufferedReader br = new BufferedReader(new FileReader("data.txt"));
List<String> aisLines = new ArrayList<String>();
String line, cvsSplitBy = ",";
try {
while ((line = br.readLine()) != null) {
if(line.charAt(0) == '!') {
String[] cols = line.split(cvsSplitBy);
if(cols.length>=8) {
line = "";
for(int i=0; i<cols.length-1; i++) {
if(i == cols.length-2) {
line = line + cols[i];
} else {
line = line + cols[i] + ",";
}
}
aisLines.add(line);
} else {
aisLines.add(line);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
If you really want a big String at the end (because you're interfacing with someone else's code, or whatever), it'll still be faster to convert the ArrayList back into a single string, than to do what you were doing.

As the most consuming operation is IO the most efficient way is to split threads for parsing and reading:
private static void readFast(String filePath) throws IOException, InterruptedException {
ExecutorService executor = Executors.newWorkStealingPool();
BufferedReader br = new BufferedReader(new FileReader(filePath));
List<String> parsed = Collections.synchronizedList(new ArrayList<>());
try {
String line;
while ((line = br.readLine()) != null) {
final String l = line;
executor.submit(() -> {
if (l.charAt(0) == '!') {
parsed.add(parse(l));
}
});
}
} catch (IOException e) {
e.printStackTrace();
}
executor.shutdown();
executor.awaitTermination(1000, TimeUnit.MINUTES);
String result = parsed.stream().collect(Collectors.joining("\n"));
}
For my pc it has taken 386ms vs 10787ms with the slow one

You can use single thread reads your large csv file and multiple threads parse all lines. The way I do is using Producer-Consumer pattern and BlockingQueue.
Producer
Making one Producer Thread which is only responsible for reading the lines of your csv file, and stores lines into BlockingQueue. The producer side does not do anything else.
Consumers
Making multiple Consumer Threads, pass the same BlockingQueue object into your consumers. Implementing time consuming work in your Consumer Thread class.
The following code provide you an idea of solving problem, not the solution.
I was implemented this using python and it works much faster than using a single thread do everything. The language is not java, but the theory behind is the same.
import multiprocessing
import Queue
QUEUE_SIZE = 2000
def produce(file_queue, row_queue,):
while not file_queue.empty():
src_file = file_queue.get()
zip_reader = gzip.open(src_file, 'rb')
try:
csv_reader = csv.reader(zip_reader, delimiter=SDP_DELIMITER)
for row in csv_reader:
new_row = process_sdp_row(row)
if new_row:
row_queue.put(new_row)
finally:
zip_reader.close()
def consume(row_queue):
'''processes all rows, once queue is empty, break the infinit loop'''
while True:
try:
# takes a row from queue and process it
pass
except multiprocessing.TimeoutError as toe:
print "timeout, all rows have been processed, quit."
break
except Queue.Empty:
print "all rows have been processed, quit."
break
except Exception as e:
print "critical error"
print e
break
def main(args):
file_queue = multiprocessing.Queue()
row_queue = multiprocessing.Queue(QUEUE_SIZE)
file_queue.put(file1)
file_queue.put(file2)
file_queue.put(file3)
# starts 3 producers
for i in xrange(4):
producer = multiprocessing.Process(target=produce,args=(file_queue,row_queue))
producer.start()
# starts 1 consumer
consumer = multiprocessing.Process(target=consume,args=(row_queue,))
consumer.start()
# blocks main thread until consumer process finished
consumer.join()
# prints statistics results after consumer is done
sys.exit(0)
if __name__ == "__main__":
main(sys.argv[1:])

Related

Trying to read multiple lines of cmd input

I'm trying to write a method that:
Prints out a message (Something like: "Paste your input: ")
Waits that the user presses enter.
Reads all the lines, that got pasted and adds them up in one String.
(An empty line can be used to determine the end of the input.)
The first syso does the printing part and also the first line gets read correctly, but then it never exits the while loop. Why? There has to be an end?
public static String readInput(String msg) {
System.out.print(msg);
String res = "";
try (BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in))) {
String line;
while ((line = buffer.readLine()) != null && !line.isBlank())
res += "\n" + line;
} catch (IOException e) {
e.printStackTrace();
}
return res;
}
Ive already seen the following sites, but they all didn't help:
How to read input with multiple lines in Java
https://www.techiedelight.com/read-multi-line-input-console-java/
Make the console wait for a user input to close
Edit:
The same bug applies for:
public static String readInput(String msg) {
System.out.print(msg);
String res = "";
try (BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in))) {
res = buffer.lines().reduce("", (r, l) -> r + "\n" + l);
System.out.println(res);
} catch (IOException e) {
e.printStackTrace();
}
return res;
}
Edit 2:
I've tried this code in my actual project and in a new test-project, but with different results. Here is a video of that test.

Why wouldn't use this statement?
while (!(line = buffer.readLine()).isEmpty())
In this case sending empty line will exit the loop.
Although, if you insert large text with empty lines (for example, the beginning of a new paragraph, or a space between them) will terminate the program.

ProcessBuilder and the gobbler threads

When searching the web for tutorials how to run a sub process in Java and handle stdin, stdout and stderr I only find solutions with gobbler threads. Using gobbler threads means creating, scheduling and cleanup of 3 threads for every sub process call. If only a few processes are called this additional overhead doesn't matter, but if thousands of sub processes are called, e.g. compiling a lot of files, this overhead creates a measurable longer processing time. In addition using gobbler threads makes the implementation more complex.
So my question is why is it common to use a such inefficient and complex solution?
Below is a much simpler and more efficient solution:
private void runProcess() throws IOException, InterruptedException {
this.prc = new ProcessBuilder(this.commandLine.split(" ")).start();
Iterator<String> it = this.getInputstreamList().iterator();
StringBuilder stdOut = new StringBuilder();
StringBuilder stdErr = new StringBuilder();
try (BufferedWriter stdInBw = new BufferedWriter(new OutputStreamWriter(this.prc.getOutputStream()), 65536)) {
while (it.hasNext()) {
String line = it.next();
stdInBw.write(line + "\n");
}
stdInBw.flush();
}
try (BufferedReader stdOutBr = new BufferedReader(new InputStreamReader(this.prc.getInputStream()), 65536)) {
try (BufferedReader stdErrBr = new BufferedReader(new InputStreamReader(this.prc.getErrorStream()), 65536)) {
while (true) {
while (stdOutBr.ready()) {
String line = stdOutBr.readLine();
if (line == null) {
break;
}
stdOut.append(line + "\n");
}
while (stdErrBr.ready()) {
String line = stdErrBr.readLine();
if (line == null) {
break;
}
stdErr.append(line + "\n");
}
if (!this.prc.isAlive()) {
break;
}
this.prc.waitFor(50, TimeUnit.MILLISECONDS);
}
}
}
this.retVal = prc.exitValue();
this.stdOut = stdOut.toString();
this.stdErr = stdErr.toString();
}
My tests have shown that the above solution works like a charm. But maybe I've missed something which makes the above solution unusable in special cases. Any hints or doubts?

Java: Improving speed of a reader program

Hey so I am working on this program that reads CSV files and I need to make a method which can return one entire column on values.
Currently I do it like this:
List<String> data = new LinkedList<>();
for(int i = 0; i < getRowCount(); i++){
data.add(getRow(i).get(column));
}
Where getRow() is this:
List<String> data = new LinkedList<>();
String column;
try (BufferedReader bufferedReader = new BufferedReader(new FileReader(file))) {
for(int i = 0; i < row; i++){
bufferedReader.readLine();
}
column = bufferedReader.readLine();
for(String col: column.split(columnSeparator.toString())){
data.add(col);
}
} catch (IOException e) {
e.printStackTrace();
}
and it works. But the flaw is, if there are too many columns in a file it takes way too long. It takes 27 secondso n 7500 lines and 9 columns. Over 10 minutes on 35000 lines and 16 columns. Do you know how could I make it faster?

Try to read the file once:
List<String> getColumn(int column) {
try (BufferedReader bufferedReader = new BufferedReader(new FileReader(file))) {
List<String> data = new LinkedList<>();
String line = bufferedReader.readLine();
while (line != null) {
String cols[] = line.split(columnSeparator.toString());
data.add(cols[column]);
line = bufferedReader.readLine();
}
return data;
} catch (IOException e) {
e.printStackTrace();
return null;
}
}

What you are doing is the following:
Prepare to read the file (Creating ReaderObject, ...), read first line
Prepare to read the file, read first line, read second line
Preapre to read the file, read first line, read second line, read third line
.. And so on.
Apparently this is not very efficient (Your doing stuff in O(n²), with n = number of lines).
You could improve your code vastly, if you do it something like this:
Prepare to read the file
Read the first line
Read the second line
... And so on.
So first read all the lines at once:
List<String> lines = new LinkedList<>();
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != null)
lines.add(line);
} catch (IOException e) {
e.printStackTrace();
}
You can then iterate over the lines to split them into columns and extract the data you're interested in:
List<String> data = new LinkedList<>();
for(String line : lines)
data.add(line.split(columnSeparator.toString())[column]);
Of course this still needs a little bit of error handling :)

I would suggest you to try this
DataType<T> listRef = getRowCount();
for(int i = 0; i < listRef.size(); i++)
{
data.add(getRow(i).get(column));
}
getRowCount is executed every single time when you call it in a for statement and you would eventually get all the rows but internally I believe that calling makes it go and execute that method getRowCount().size() times and you probably don't want to read a file that many times

reading specific lines from file is extremely slow

I have created a method that reads specific lines from a file based on their line number. It works fine for most files but when I try to read a file that contains a large number of really long lines then it takes ages, particularly as it gets further down in the file. I've also done some debugging and it appears to take a lot of memory as well but I'm not sure if this is something that can be improved. I know there are some other questions which focus on how to read certain lines from a file but this question is focussed primarily on the performance aspect.
public static final synchronized List<String> readLines(final File file, final Integer start, final Integer end) throws IOException {
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
List<String> lines = new ArrayList<>();
try {
String line = bufferedReader.readLine();
Integer currentLine = 1;
while (line != null) {
if ((currentLine >= start) && (currentLine <= end)) {
lines.add(line + "\n");
}
currentLine++;
if (currentLine > end) {
return lines;
}
line = bufferedReader.readLine();
}
} finally {
bufferedReader.close();
}
return lines;
}
How can I optimize this method to be faster than light?

I realised that what I was doing before was inherently slow and used up too much memory.
By adding all lines to memory and then processing all lines in a List it was not only taking twice as long but was also creating String variables for no reason.
I am now using Java 8 Stream and processing at point of reading which is the fastest method I've used so far.
Path path = Paths.get(file.getAbsolutePath());
Stream<String> stream = Files.lines(path, StandardCharsets.UTF_8);
for (String line : (Iterable<String>) stream::iterator) {
//do stuff
}
}

BufferedReader - Search for string inside .txt file

I am trying, using a BufferedReader to count the appearances of a string inside a .txt file. I am using:
File file = new File(path);
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
int appearances = 0;
while ((line = br.readLine()) != null) {
if (line.contains("Hello")) {
appearances++;
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Found " + appearances);
But the problem is that if my .txt file contains for example the string "Hello, world\nHello, Hello, world!" and "Hello" is to be found then the appearances become two instead of three because it searches a line for only one appearance of the string. How could I fix this? Thanks a lot

The simplest solution is to do
while ((line = br.readLine()) != null)
appearances += line.split("Hello", -1).length-1;
Note that, if instead of "Hello", you search for anything with regex-reserved characters, you should escape the string before splitting:
String escaped = Pattern.quote("Hello."); // avoid '.' special meaning in regex
while ((line = br.readLine()) != null)
appearances += line.split(escaped, -1).length-1;

This is an efficent and correct solution:
String line;
int count = 0;
while ((line = br.readLine()) != null)
int index = -1;
while((index = line.indexOf("Hello",index+1)) != -1){
count++;
}
}
return count;
It walks through the line and looks for the next index, starting from the previous index+1.
The problem with Peter's solution is that it is wrong (see my comment). The problem with TheLostMind's solution is that it creates a lot of new strings by replacement which is an unnecessary performance drawback.

A regex-driven version:
String line;
Pattern p = Pattern.compile(Pattern.quote("Hello")); // quotes in case you need 'Hello.'
int count = 0;
while ((line = br.readLine()) != null)
for (Matcher m = p.matcher(line); m.find(); count ++) { }
}
return count;
I am now curious as to performance between this and gexicide's version - will edit when I have results.
EDIT: benchmarked by running 100 times on a ~800k log file, looking for strings that were found once at the start, once around middle-ish, once at the end, and several times throughout. Results:
IndexFinder: 1579ms, 2407200hits. // gexicide's code
RegexFinder: 2907ms, 2407200hits. // this code
SplitFinder: 5198ms, 2407200hits. // Peter Lawrey's code, after quoting regexes
Conclussion: for non-regex strings, the repeated-indexOf approach is fastest by a nice margin.
Essential benchmark code (log file from vanilla Ubuntu 12.04 installation):
public static void main(String ... args) throws Exception {
Finder[] fs = new Finder[] {
new SplitFinder(), new IndexFinder(), new RegexFinder()};
File log = new File("/var/log/dpkg.log.1"); // around 800k in size
Find test = new Find();
for (int i=0; i<100; i++) {
for (Finder f : fs) {
test.test(f, log, "2014"); // start
test.test(f, log, "gnome"); // mid
test.test(f, log, "ubuntu1"); // end
test.test(f, log, ".1"); // multiple; not at start
}
}
test.printResults();
}

while (line.contains("Hello")) { // search until line has "Hello"
appearances++;
line = line.replaceFirst("Hello",""); // replace first occurance of "Hello" with empty String
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading a large text file faster - java

Related

Trying to read multiple lines of cmd input

ProcessBuilder and the gobbler threads

Java: Improving speed of a reader program

reading specific lines from file is extremely slow

BufferedReader - Search for string inside .txt file

Categories

Resources