How do we determine the number of lines in a text file?

How do we determine the number of lines in a text file? - java

Hi all I have a local file which looks like this:
AAA Anaa
AAC EL-ARISH
AAE Annaba
AAF APALACHICOLA MUNI AIRPORT
AAG ARAPOTI
AAL Aalborg Airport
AAM Mala Mala
AAN Al Ain
AAQ Anapa
AAR Aarhus Tirstrup Airport
AAT Altay
AAX Araxa
AAY Al Ghaydah
...
Java Tutorials suggests estimating the number of lines in a file by doing java.io.File.length
and dividing the result by 50.
But isn't there a more "solid" way to get the number of lines in a text file (yet without having to pay for the overhead of reading the entire file)?

Can't you just read the file with a FileReader and count the number of lines read?
int lines = 0;
BufferedReader br = new BufferedReader(new FileReader("foo.in"));
while (br.readLine != null) {
lines++;
}

The benefit to the estimation algorithm you've got is that it is very fast: one stat(2) call and then some division. It'll take the same length of time and memory no matter how large or small the file is. But it's also vastly wrong on a huge number of inputs.
Probably the best way to get the specific number is to actually read through the entire file looking for '\n' characters. If you read the file in in large binary blocks (think 16384 bytes or a larger power of two) and look for the specific byte you're interested in, it can go at something approaching the disk IO bandwidth.

You need to use BufferedReader and a counter which increment the value 1 for each readLine().

Related

What's the fastest way in Java to count lines starting with a String in a huge file

I have huge files (4.5 GB each) and need to count the number of lines in each file that start with a given token. There can be up to 200k occurrences of the token per file.
What would be the fastest way to achieve such a huge file traversal and String detection? Is there a more efficient approach than the following implementation using a Scanner and String.startsWith()?
public static int countOccurences(File inputFile, String token) throws FileNotFoundException {
int counter = 0;
try (Scanner scanner = new Scanner(inputFile)) {
while (scanner.hasNextLine()) {
if (scanner.nextLine().startsWith(token)) {
counter++;
}
}
}
return counter;
}
Note:
So far it looks like the Scanner is the bottleneck (i.e. if I add more complex processing than token detection and apply it on all lines, the overall execution time is more or less the same.)
I'm using SSDs so there is no room for improvement on the hardware side
Thanks in advance for your help.

A few pointers (assumption is that the lines are relatively short and the data is really ASCII or similar) :
read a huge buffer of bytes at a time, (say 1/4 GB), then chop off the incomplete line to prepend to the next read.
search for bytes, do not waste time converting to chars
indicate "beginning of line by starting your search pattern with '\n', treat first line specially
use high-speed search that reduces search time at the expense of pre-processing (google for "fast substring search")
if actual line numbers (rather than the lines) are needed, count the lines in a separate stage

We can reduce the problem to searching for \n<token> in a bytestream. In that case, one quick way is to read a chunk of data sequentially from disk (The size is determined empirically, but a good starting point is 1024 pages), and hand that data to a different thread for processing.

How to increase io performances of this piece of code

How can I make this piece of code extremely quick?
It reads a raw image using RandomAccessFile (in) and write it in a file using DataOutputStream (out)
final int WORD_SIZE = 4;
byte[] singleValue = new byte[WORD_SIZE];
long position;
for (int i=1; i<=100000; i++)
{
out.writeBytes(i + " ");
for(int j=1; j<=17; j++)
{
in.seek(position);
in.read(singleValue);
String str = Integer.toString(ByteBuffer.wrap(singleValue).order(ByteOrder.LITTLE_ENDIAN).getInt());
out.writeBytes(str + " ");
position+=WORD_SIZE;
}
out.writeBytes("\n");
}
The inner for creates a new line in the file every 17 elements
Thanks

I assume that the reason you are asking is because this code is running really slowly. If that is the case, then one reason is that each seek and read call is doing a system call. A RandomAccessFile has no buffering. (I'm guessing that singleValue is a byte[] of length 1.)
So the way to make this go faster is to step back and think about what it is actually doing. If I understand it correctly, it is reading each 4th byte in the file, converting them to decimal numbers and outputting them as text, 17 to a line. You could easily do that using a BufferedInputStream like this:
int b = bis.read(); // read a byte
bis.skip(3); // skip 3 bytes.
(with a bit of error checking ....). If you use a BufferedInputStream like this, most of the read and skip calls will operate on data that has already been buffered, and the number of syscalls will reduce to 1 for every N bytes, where N is the buffer size.
UPDATE - my guess was wrong. You are actually reading alternate words, so ...
bis.read(singleValue);
bis.skip(4);
Every 100000 offsets I have to jump 200000 and then do it again till the end of the file.
Use bis.skip(800000) to do that. It should do a big skip by moving the file position without actually reading any data. One syscall at most. (For a FileInputStream, at least.)
You can also speed up the output side by a roughly equivalent amount by wrapping the DataOutputStream around a BufferedOutputStream.
But System.out is already buffered.

BufferedReader in Scanner's constructor

I am studying the BufferedReader,Scanner and InputStreamReader classes and their differences and i understand the purpose of each one. I want an explanation to clarify one thing : what is the purpose of passing the BufferedReader in the Scanner's constructor? What is the specific reason for doing that?
Below is the example i am referring to.
Scanner s = null;
try {
s = new Scanner(new BufferedReader(new FileReader("file....")));
//more code here.........

A BufferedReader will create a buffer. This should result in faster reading from the file. Why? Because the buffer gets filled with the contents of the file. So, you put a bigger chunk of the file in RAM (if you are dealing with small files, the buffer can contain the whole file). Now if the Scanner wants to read two bytes, it can read two bytes from the buffer, instead of having to ask for two bytes to the hard drive.
Generally speaking, it is much faster to read 10 times 4096 bytes instead of 4096 times 10 bytes.

How to deal with reading and processing huge text files without getting OutofMemoryError

I wrote some straightforward code to read text files (>1g) and do some processing on Strings.
However, I have to deal with Java heap space problems since I try to append Strings (using StringBuilder) that are getting to big on memory usage at some point. I know that I can increase my heap space with, e. g. '-Xmx1024', but I would like to work with only little memory usage here.How could I change my code below to manage my operations?
I am still a Java novice and maybe I made some mistakes in my code which may seem obvious to you.
Here's the code snippet:
private void setInputData() {
Pattern pat = Pattern.compile("regex");
BufferedReader br = null;
Matcher mat = null;
try {
File myFile = new File("myFile");
FileReader fr = new FileReader(myFile);
br = new BufferedReader(fr);
String line = null;
String appendThisString = null;
String processThisString = null;
StringBuilder stringBuilder = new StringBuilder();
while ((line = br.readLine()) != null) {
mat = pat.matcher(line);
if (mat.find()) {
appendThisString = mat.group(1);
}
if (line.contains("|")) {
processThisString = line.replace(" ", "").replace("|", "\t");
stringBuilder.append(processThisString).append("\t").append(appendThisString);
stringBuilder.append("\n");
}
}
// doSomethingWithTheString(stringBuilder.toString());
} catch (Exception ex) {
ex.printStackTrace();
} finally {
try {
if (br != null)br.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
Here's the error message:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuilder.append(StringBuilder.java:132)
at Test.setInputData(Test.java:47)
at Test.go(Test.java:18)
at Test.main(Test.java:13)

You could do a dry run, without appending, but counting the total string length.
If doSomethingWithTheString is sequential there would be other solutions.
You could tokenize the string, reducing the size. For instance Huffman compression looks for already present sequences reading a char, possible extends the table and then yields a table index. (The open source OmegaT translation tool uses such a strategy at one spot for tokens.) So it depends on the processing you want to do. Seeing the reading of a kind of CSV a dictionary seems feasible.
In general I would use a database.
P.S. you can save half the memory, writing all to a file, and then rereading the file in one string. Or use a java.nio ByteBuffer on the file, a memory mapped file.

You can't use StringBuilder in this case. It holds data in memory.
I think you should consider saving the result into file in every line.
i.e. Use FileWriter instead of StringBuilder.

The method doSomethingWithTheString() should probably need to change so that it accepts an InputStream as well. While reading the original file content and transforming it line by line you should write the transformed content to a temporary file line by line. Then an input stream to that temporary file could be send to the doSomethingWithTheString() method. Probably the method needs to be renamed as doSomethingWithInputStream().

From your example it is not clear what you are going to do with your enormous string once you have modified it. However since your modifications do not appear to span multiple lines I'd just write the modified data to a new file.
In order to do that create and open a new FileWriter object before your while cycle, move your stringBuffer declaration to the beginning of the cycle and write stringBuffer to your new file at the end of the cycle.
If, on the other hand, you do need to combine data coming from different lines consider using a database. Which kind depends on the nature of your data. If it has a record-like organization you might adopt a relational database, such as Apache Derby or MySQL, otherwise you might check out so called No SQL databases, such as Cassandra or MongoDB.

The general strategy is to design your application so that it doesn't need to hold the entire file (or too large a proportion of it) in memory.
Depending on what your application does:
You could write the intermediate data to a file and read it back again a line at a time to process it.
You could pass each line read to the processing algorithm; e.g. by calling doSomethingWithTheString(...) on each line individually rather than all of them.
But if you need to have the entire file in memory, you are between a rock and a hard place.
The other thing to note is that using a StringBuilder like that may require up to 6 times as much memory as the file size. It goes like this.
When the StringBuilder needs to expand its internal buffer it does this by making a char array twice the size of the current buffer, and copying from the old to the new. At that point you have 3 times as much buffer space allocated as you have before the buffer expansion started. Now suppose that there was just one more character to append to the buffer.
If the file is in ASCII (or another 8 bit charset), the StringBuilder's buffer needs twice that amount of memory ... because it consists of char not byte values.
If you have a good estimate of the number of characters that will be in the final string (e.g. from the file size), you can avoid the x3 multiplier by giving a capacity hint when you create the StringBuilder. However, you mustn't underestimate, 'cos if you underestimate just slightly ...
You could also use a byte-oriented buffer (e.g. a ByteArrayOutputStream) instead of a StringBuilder ... and then read it with a ByteArrayInputStream / StreamReader / BufferedReader pipeline.
But ultimately, holding a large file in memory doesn't scale as the file size increases.

Are you sure there is a line terminator in the file? If not, your while loop will just keeps looping and leads to your error. If so, it might worth trying reading a fixed number of bytes at a time so that the reader won't grow infinitely.

I suggest the use of Guavas FileBackedOutputStream. You gain the advantage of having an OutputStream that will eat up disk io instead of main memory. Of course access will be slower due to the disk io, but, if you are dealing with such a large stream, and you are unable to chunk it into a more managable size, it is a good option.

How do I convert a file's line number to a byte offset (or get the byte offset of the beginning of each line with a BufferedReader)?

I'm using a FileReader wrapped in a LineNumberReader to index a large text file for speedy access later on. Trouble is I can't seem to find a way to read a specific line number directly. BufferedReader supports the skip() function, but I need to convert the line number to a byte offset (or index the byte offset in the first place).
I took a crack at it using RandomAccessFile, and while it worked, it was horribly slow during the initial indexing. BufferedReader's speed is fantastic, but... well, you see the problem.
Some key info:
The file can be any size (currently 35,000 lines)
It's stored on Android's internal filesystem (via getFilesDir() to be exact)
The formatting is not fixed width, unfortunately (hence the need to read by line)
Any ideas?

Describes an extended RandomAccessFile with buffering semantics

Trouble is I can't seem to find a way to read a specific line number directly
Unless you know the length of each line you can't read it directly
There is no shortcut, you will need to read then entire file up front and calculate the offsets manualy.
I would just use a BufferedReader and then get the length of each string and add 1 (or 2?) for the EOL string.

Consider saving an file index along with the large text file. If this file is something you are generating, either on your server, or on the device, it should be trivial to generate an index once and distribute and/or save it along with the file.
I'd recommend an int[] where each value is the absolute offset in bytes for the n*(index+1) th line. So you could have an array of size 35,000 with the start of each line, or an array of size 350, with the start of every 100th line.
Here's an example assuming you have an index file containing an raw sequence of int values:
public String getLineByNumber(RandomAccessFile index,
RandomAccessFile data,
int lineNum) {
index.seek(lineNum*4);
data.seek(index.readInt());
return data.readLine();
}

I took a crack at it using
RandomAccessFile, and while it worked,
it was horribly slow during the
initial indexing
You've started the hard part already. Now for the harder part.
BufferedReader's speed is fantastic,
but...
Is there something in your use of RandomAccessFile that made it slower than it has to be? How many bytes did you read at a time? If you read one byte at a time it will be sloooooow. IF you read in an array of bytes at a time, you can speed things up and use the byte array as a buffer.

Just wrapping up the previous comments :
Either you use RandomAccessFile to first count byte and second parse what you read to find lines by hand OR you use a LineNumberReader to first read lines by lines and count the bytes of each line of char (2 bytes in utf 16 ?) by hand.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.