Splitting a file in Java for Multithreading

Splitting a file in Java for Multithreading - java

For a project I am working on, I am trying to count the vowels in text file as fast as possible. In order to do so, I am trying a concurrent approach. I was wondering if it is possible to concurrently read a text file as a way to speed up the counting? I believe the bottleneck is the I/O, and since right now I am reading the file in via a buffered reader and processing line by line, I was wondering if it was possible to read multiple sections of the file at once.
My original thought was to use
Split File - Java/Linux
but apparently MappedByteBuffers are not great performance wise, and I still need to read line by line from each MappedByteBuffer once I split.
Another option is to split after reading a certain number of lines, but that defeats the purpose.
Would appreciate any help.

The following will NOT split the file - but can help in concurrently processing it!
Using Streams in Java 8 you can do things like:
Stream<String> lines = Files.lines(Paths.get(filename));
lines.filter(StringUtils::isNotEmpty) // ignore empty lines
and if you want to run in parallel you can do:
lines.parallel().filter(StringUtils::isNotEmpty)
In the example above I was filtering empty lines - but of course you can modify it to your use (counting vowels) by implementing your own method and calling it.

Related

How to read file line by line in Java 8?

In Java 8 I see new method is added called lines() in Files class which can be used to read a file line by line in Java. Does it work for huge files? I mean can we load first 1000 lines then second set of 1000 lines. I have huge file with 1GB, Will it work?
Could someone share code snippet how to use it?

Does it work for huge files? [...] I have huge file with 1GB, Will it
work?
As far as I can see it should work well for big files as well (but I haven't tried):
try(Stream<String> lines = Files.lines(path)){
lines.filter(...).map(...)....foreach(...);
}
I mean can we load first 1000 lines then second set of 1000 lines.
How many lines are read at one time is implementation specific to Files.lines (which probably uses a BufferedReader, but I might be wrong).

From the API (embolden by me)
Read all lines from a file as a Stream. Unlike readAllLines, this method does not read all lines into a List, but instead populates lazily as the stream is consumed.
This very strongly suggests that you can use this on any arbitrarily sized file, assuming your code doesn't hold all of the content in memory.

Splitting a large log file in to multiple files in Scala

I have a large log file with client-id as one of the fields in each log line. I would like to split this large log file in to several files grouped by client-id. So, if the original file has 10 lines with 10 unique client-ids, then at the end there will be 10 files with 1 line in each.
I am trying to do this in Scala and don't want to load the entire file in to memory, load one line at a time using scala.io.Source.getLines(). That is working nicely. But, I don't have a good way to write it out in to separate files one line at a time. I can think of two options:
Create a new PrintWriter backed by a BufferedWriter (Files.newBufferedWriter) for every line. This seems inefficient.
Create a new PrintWriter backed by a BufferedWriter for every output File, hold on to these PrintWriters and keep writing to them till we read all lines in the original log file and the close them. This doesn't seems a very functional way to do in Scala.
Being new to Scala I am not sure of there are other better way to accomplish something like this. Any thoughts or ideas are much appreciated.

You can do the second option in pretty functional, idiomatic Scala. You can keep track of all of your PrintWriters, and fold over the lines of the file:
import java.io._
import scala.io._
Source.fromFile(new File("/tmp/log")).getLines.foldLeft(Map.empty[String, PrintWriter]) {
case (printers, line) =>
val id = line.split(" ").head
val printer = printers.get(id).getOrElse(new PrintWriter(new File(s"/tmp/log_$id")))
printer.println(line)
printers.updated(id, printer)
}.values.foreach(_.close)
Maybe in a production level version, you'd want to wrap the I/O operations in a try (or Try), and keep track of failures that way, while still closing all the PrintWriters at the end.

Java: read groups of lines with same prefix from very large text file

I have a large (~100GB) text file structured like this:
A,foobar
A,barfoo
A,foobar
B,barfoo
B,barfoo
C,foobar
Each line is a comma-separated pair of values. The file is sorted by the first value in the pair. The lines are variable length. Define a group as being all lines with a common first value, i.e. with the example quoted above all lines starting with "A," would be a group, all lines starting with "B," would be another group.
The entire file is too large to fit into memory, but if you took all the lines from any individual group will always fit into memory.
I have a routine for processing a single such group of lines and writing to a text file. My problem is that I don't know how best to read the file a group at a time. All the groups are of arbitrary, unknown size. I have considered two ways:
1) Scan the file using a BufferedReader, accumulating the lines from a group in a String or array. Whenever a line is encountered that belongs to a new group, hold that line in a temporary variable, process the previous group. Clear the accumulator, add the temporary and then continue reading the new group starting from the second line.
2) Scan the file using a BufferedReader, whenever a line is encountered that belongs to a new group, somehow reset the cursor so that when readLine() is next invoked it starts from the first line of the group instead of the second. I have looked into mark() and reset() but these require knowing the byte-position of the start of the line.
I'm going to go with (1) at the moment, but I would be very grateful if someone could suggest a method that smells less.

I think a PushbackReader would work:
if (lineBelongsToNewGroup){
reader.unread(lastLine.toCharArray());
// probably also unread a newline
}

I think option 1 is the simplest. I would parse the text yourself, rather than use BufferedReader as it will take a lone time to parse 100 GB.
The only option which is likely to be faster is to use a binary search accessing the file using RandomAccessFile. You can memory map 100 GB on a 64-bit JVM. This avoids the need to parse every line which is pretty expensive. An advantage of this approach is that you can use multiple threads Its is far, far more complicated to implement, but should be much faster. Once you have each boundary, you can copy the raw data in bulk without having to parse all the lines.

Need help using java threads to download file parts

I am trying to download a file from a server in a user specified number of parts (n). So there is a file of x bytes divided into n parts with each part downloading a piece of the whole file at the same time. I am using threads to implement this, but I have not worked with http before and do not really understand how downloading a file really works. I have read up on it and it seems "Range" needs to be used, but I do not know how to download different parts and being able to append them without corrupting the data.

(Since it's a homework assignment I will only give you a hint)
Appending to a single file will not help you at all, since this will mess up the data. You have two alternatives:
Download from each thread to a separate temporary file and then merge the temporary files in the right order to create the final file. This is probably easier to conceive, but a rather ugly and inefficient approach.
Do not stick to the usual stream-style semantics - use random access (1, 2) to write data from each thread straight to the right location within the output file.

reading a file while it's being written

I've read some posts on stackoverflow about this topic but I'm still confused. When reading a file that is currently being written in Java, how do you keep track of how many lines have actually been written so that you don't get weird read results?
EDIT: sorry, I should have mentioned that the file writing it is in C++ and the one reading it is in Java so variables can't really be shared easily

When reading a file that is currently being written in Java, how do you keep track of how many lines have actually been written so that you don't get weird read results?
The problem is that you can never be sure that the current last character of the file is the end of a line. If it is a line terminator, you are OK. If BufferedReader.readLine() will interpret it as a complete line without a line terminator ... and weird results will ensue.
What you need to do is to implement your own line buffering. When you get an EOF you wait until the file grows some more and then resume reading the line.
Alternatively, if you are using Java 7 or later, the file watcher APIs allow you to watch for file writes without polling the file's size.
By the way, there is an Apache commons class that is designed for doing this kind of thing:
http://commons.apache.org/io/api-2.0/org/apache/commons/io/input/Tailer.html

If I understand, the file is being written in C# in some process and another Java process wants to read it while it is being written.
Look at File Monitoring section on the tail command here. But I want to warn you that when I used the cygwin tail on Windows recently to follow log files that were rolling over, it sometimes failed under heavy load. Other implementations may be more robust.

To have a count of the number of lines, just keep a counter on the side that's doing the writing.
So, every time you write a line, increment a counter, and make that counter readable via a method, something like, public int getNumLinesWritten()

The obvious answer to me... Why not use a buffer? Use a string or whatever you need. (You could use a list/array of strings if you want, one for each line maybe?) Append to the string just as you would write to the file, then instead of reading from the file, read from that string. Would that work for you?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.