How to read file line by line in Java 8? - java

In Java 8 I see new method is added called lines() in Files class which can be used to read a file line by line in Java. Does it work for huge files? I mean can we load first 1000 lines then second set of 1000 lines. I have huge file with 1GB, Will it work?
Could someone share code snippet how to use it?

Does it work for huge files? [...] I have huge file with 1GB, Will it
work?
As far as I can see it should work well for big files as well (but I haven't tried):
try(Stream<String> lines = Files.lines(path)){
lines.filter(...).map(...)....foreach(...);
}
I mean can we load first 1000 lines then second set of 1000 lines.
How many lines are read at one time is implementation specific to Files.lines (which probably uses a BufferedReader, but I might be wrong).

From the API (embolden by me)
Read all lines from a file as a Stream. Unlike readAllLines, this method does not read all lines into a List, but instead populates lazily as the stream is consumed.
This very strongly suggests that you can use this on any arbitrarily sized file, assuming your code doesn't hold all of the content in memory.

Related

Splitting a file in Java for Multithreading

For a project I am working on, I am trying to count the vowels in text file as fast as possible. In order to do so, I am trying a concurrent approach. I was wondering if it is possible to concurrently read a text file as a way to speed up the counting? I believe the bottleneck is the I/O, and since right now I am reading the file in via a buffered reader and processing line by line, I was wondering if it was possible to read multiple sections of the file at once.
My original thought was to use
Split File - Java/Linux
but apparently MappedByteBuffers are not great performance wise, and I still need to read line by line from each MappedByteBuffer once I split.
Another option is to split after reading a certain number of lines, but that defeats the purpose.
Would appreciate any help.
The following will NOT split the file - but can help in concurrently processing it!
Using Streams in Java 8 you can do things like:
Stream<String> lines = Files.lines(Paths.get(filename));
lines.filter(StringUtils::isNotEmpty) // ignore empty lines
and if you want to run in parallel you can do:
lines.parallel().filter(StringUtils::isNotEmpty)
In the example above I was filtering empty lines - but of course you can modify it to your use (counting vowels) by implementing your own method and calling it.

Splitting a large log file in to multiple files in Scala

I have a large log file with client-id as one of the fields in each log line. I would like to split this large log file in to several files grouped by client-id. So, if the original file has 10 lines with 10 unique client-ids, then at the end there will be 10 files with 1 line in each.
I am trying to do this in Scala and don't want to load the entire file in to memory, load one line at a time using scala.io.Source.getLines(). That is working nicely. But, I don't have a good way to write it out in to separate files one line at a time. I can think of two options:
Create a new PrintWriter backed by a BufferedWriter (Files.newBufferedWriter) for every line. This seems inefficient.
Create a new PrintWriter backed by a BufferedWriter for every output File, hold on to these PrintWriters and keep writing to them till we read all lines in the original log file and the close them. This doesn't seems a very functional way to do in Scala.
Being new to Scala I am not sure of there are other better way to accomplish something like this. Any thoughts or ideas are much appreciated.
You can do the second option in pretty functional, idiomatic Scala. You can keep track of all of your PrintWriters, and fold over the lines of the file:
import java.io._
import scala.io._
Source.fromFile(new File("/tmp/log")).getLines.foldLeft(Map.empty[String, PrintWriter]) {
case (printers, line) =>
val id = line.split(" ").head
val printer = printers.get(id).getOrElse(new PrintWriter(new File(s"/tmp/log_$id")))
printer.println(line)
printers.updated(id, printer)
}.values.foreach(_.close)
Maybe in a production level version, you'd want to wrap the I/O operations in a try (or Try), and keep track of failures that way, while still closing all the PrintWriters at the end.

Combining two files without creating another one

My program generates two files. The first one that gets generated is usually huge, normally around 20GB. The one after that is a 'one line' file. The second file's content (one line) is the header for the first file. So my output should be one file that combines the two. On a memory constraint, I can't create another file to combine the two. What's the best way to go over that?
You cannot just "insert" data in the middle of the file. Using RandomAccessFile will override the data that is already written to specific position of the file.
So, the first solution is (if it is possible) create the header and then append your 20GB. If it is not possible but you can estimate the length (in bytes) of your header you can write garbage if the same length to the beginning of the file, then write your data, then go to the beginning of the file and write (override) the header.
Use Random Access File that is available...

reading a file while it's being written

I've read some posts on stackoverflow about this topic but I'm still confused. When reading a file that is currently being written in Java, how do you keep track of how many lines have actually been written so that you don't get weird read results?
EDIT: sorry, I should have mentioned that the file writing it is in C++ and the one reading it is in Java so variables can't really be shared easily
When reading a file that is currently being written in Java, how do you keep track of how many lines have actually been written so that you don't get weird read results?
The problem is that you can never be sure that the current last character of the file is the end of a line. If it is a line terminator, you are OK. If BufferedReader.readLine() will interpret it as a complete line without a line terminator ... and weird results will ensue.
What you need to do is to implement your own line buffering. When you get an EOF you wait until the file grows some more and then resume reading the line.
Alternatively, if you are using Java 7 or later, the file watcher APIs allow you to watch for file writes without polling the file's size.
By the way, there is an Apache commons class that is designed for doing this kind of thing:
http://commons.apache.org/io/api-2.0/org/apache/commons/io/input/Tailer.html
If I understand, the file is being written in C# in some process and another Java process wants to read it while it is being written.
Look at File Monitoring section on the tail command here. But I want to warn you that when I used the cygwin tail on Windows recently to follow log files that were rolling over, it sometimes failed under heavy load. Other implementations may be more robust.
To have a count of the number of lines, just keep a counter on the side that's doing the writing.
So, every time you write a line, increment a counter, and make that counter readable via a method, something like, public int getNumLinesWritten()
The obvious answer to me... Why not use a buffer? Use a string or whatever you need. (You could use a list/array of strings if you want, one for each line maybe?) Append to the string just as you would write to the file, then instead of reading from the file, read from that string. Would that work for you?

Reading a gz file and keeping track of position in file

So, here is the situation:
I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access.
In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)
So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.
Is there anything available or do I have to roll my own?
A few additional comments:
I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8
I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).
Thanks for tips,
Arnaud
I think jzran could be pretty much what you're looking for:
It's a Java library based on the
zran.c sample from zlib.
You can preprocess a large gzip
archive, producing an "index" that can
be used for random read access.
You can balance between index size and
access speed.
What you are looking for is called mark(), markSupported() and skip().
This methods are declared both in InputStream and Reader, so you are welcome to use them.
GZIP compression does not support seeking. Previous data blocks are needed to build compression tables...

Categories

Resources