Line reader fails depending on buffer size - java

I need to read an input stream line by line. A line is considered to be terminated only by CRLF, but not by a single CR or LF. This rules out BufferedReader's readLine() and had me implement my own solution:
final class LineReader
{
private final Reader reader;
private final char[] buffer;
private final Queue<String> lines = new LinkedList<>();
private StringBuilder line = new StringBuilder();
private boolean cr = false;
LineReader(final Reader reader, final int bufferSize)
{
this.reader = reader;
buffer = new char[bufferSize];
}
String readLine() throws IOException
{
while (lines.peek() == null)
{
final int read = reader.read(buffer);
if (read == - 1)
{
if (line == null)
{
return null;
}
// Reached EOF. Return the last line.
lines.add(line.toString());
line = null;
continue;
}
// Split the buffer by line.
int offset = 0;
for (int i = 0; i < read; i++)
{
final char ch = buffer[i];
if (cr)
{
// Last character was CR.
switch (ch)
{
case '\n':
// Found a CRLF.
if (i != 0)
{
line.append(buffer, offset, i - 1 - offset);
}
// Next line starts at the next character.
offset = i + 1;
lines.add(line.toString());
line = new StringBuilder();
cr = false;
break;
case '\r':
break;
default:
cr = false;
break;
}
}
else if (ch == '\r')
{
cr = true;
}
}
// Append remaining characters to the next line.
line.append(buffer, offset, read - offset);
}
return lines.poll();
}
}
Initially, the reader passed some naive tests. However, once I started altering the buffer size, I noticed that some tests failed.
#Test
void readLine() throws IOException
{
final String[] lines = new String[]{"foo bar", "baz", ""};
final String str = Stream.of(lines).collect(joining("\r\n"));
final Collection<Executable> assertions = new LinkedList<>();
for (int bufferSize = 1; bufferSize <= 10; bufferSize++)
{
final LineReader reader = new LineReader(new StringReader(str),
bufferSize);
assertions.add(() ->
{
for (int i = 0; i < lines.length; i++)
{
assertEquals(lines[i], reader.readLine());
}
assertNull(reader.readLine());
});
}
assertAll(assertions);
}
More specifically, the equality assertion only fails when the buffer size is set to 1, 2, 4 or 8. And even stranger, the error messages are all blank.
Multiple Failures (4 failures)
>
>
>
>
org.opentest4j.MultipleFailuresError: Multiple Failures (4 failures)
>
>
>
>
at org.junit.jupiter.api.AssertAll.assertAll(AssertAll.java:80)
at org.junit.jupiter.api.AssertAll.assertAll(AssertAll.java:54)
...
I can't wrap my head around this.

So it turned out that I've been wrongly appending a CR when it's the last character in the buffer, even though it is followed by a LF. The redundant CR also caused the error messages to be truncated weirdly in my console output. Below is the working method:
String readLine() throws IOException
{
while (lines.peek() == null)
{
if (line == null)
{
break;
}
final int read = reader.read(buffer);
if (read == - 1)
{
lines.add(line.toString());
line = null;
continue;
}
int offset = 0;
for (int i = 0; i < read; i++)
{
final char ch = buffer[i];
if (cr)
{
if (ch == '\n')
{
if (i != 0)
{
line.append(buffer, offset, i - 1 - offset);
}
offset = i + 1;
lines.add(line.toString());
line = new StringBuilder();
cr = false;
}
else
{
if (i == 0)
{
line.append('\r');
}
if (ch != '\r')
{
cr = false;
}
}
}
else if (ch == '\r')
{
cr = true;
}
}
line.append(buffer, offset, read - offset - (cr ? 1 : 0));
}
return lines.poll();
}

Related

In Java, how would I get this countLines method to count a line without a newline character?

The following code counts the lines in a text file, but it doesn't count them if there is a line without a newline ( '\n' ) character :
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n' /* || c[i] != null */ ) {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
When I tried adding in the code c[i] != null into the if-condition, it gave this error :
NewParentClass.java:72: error: incomparable types: byte and ''
if (c[i] == '\n' || c[i] != null ) {
BufferedReader reader = new BufferedReader(new FileReader("file.txt"));
int lines = 0;
while (reader.readLine() != null) lines++;
reader.close();
You are not using your empty flag correctly. Instead of initializing it to false ahead of the nested loop, you need to set it to true when the character is '\n' and to false when it's not:
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
empty = true;
} else {
empty = false;
}
}
}
if (!empty) {
count++;
}
return count;
Once you reach the end of the method, use empty to decide if line count should be incremented or not. This will cover situations when your file has more than one line.

Java - Writing a Method to Count Lines In a Text File Without Throwing Exceptions

Below is a solution from Number of lines in a file in Java
to quickly count the number of lines in a text file.
However, I am trying to write a method that will perform the same task without throwing an 'IOException'.
Under the original solution is my attempt to do this with a nested try-catch block <-- (Is this usually done/frowned upon/ or easily avoidable??) which returns 0 no matter how many lines are in the given file (obviously a fail).
Just to be clear, I am not looking for advice on how to better use the original method that does contain the exception and, therefore, the context within which I am using it is irrelevant to this question.
Can somebody please help me write a method that counts the number of lines in a text file and does not throw any exceptions? (In other words, deals with potential errors with a try-catch.)
Original line counter by martinus:
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
My Attempt:
public int countLines(String fileName ) {
InputStream input = null;
try{
try{
input = new BufferedInputStream(new FileInputStream(fileName));
byte[] count = new byte[1024];
int lines = 0;
int forChar;
boolean empty = true;
while((forChar = input.read(count)) != -1){
empty = false;
for(int x = 0; x < forChar; x++){
if(count[x] == '\n'){
lines++;
}
}
}
return (!empty && lines == 0) ? 1 : lines + 1;
}
finally{
if(input != null)
input.close();
}
}
catch(IOException f){
int lines = 0;
return lines;
}
}
It is more robust to use char instead of byte for '\n' and return -1 in case of any errors, for example if the filename does not exist:
public static int countLines(String filename) {
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(new FileInputStream(filename)));
char[] c = new char[1024];
int count = 0;
int readChars = 0;
boolean emptyLine = true;
while ((readChars = br.read(c)) != -1) {
for (int i = 0; i < readChars; ++i) {
emptyLine = false;
if (c[i] == '\n') {
++count;
emptyLine = true;
}
}
}
return count + (!emptyLine ? 1 : 0);
} catch (IOException ex) {
return -1;
} finally {
if (br != null)
try {
br.close();
} catch (IOException e) {
// Ignore intentionally
}
}
}
Sharing my attempt.
public static int countLines(String filename) {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
int numLines = 0;
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
numLines = (count == 0 && !empty) ? 1 : count;
} catch (IOException ex) {
numLines = 0;
} catch (FileNotFoundException ex) {
System.out.println("File not found.");
numLines = 0;
} finally {
is.close();
}
return numLines;
}

How to read file from end to start (in reverse order) in Java?

I want to read file in opposite direction from end to the start my file,
[1322110800] LOG ROTATION: DAILY
[1322110800] LOG VERSION: 2.0
[1322110800] CURRENT HOST STATE:arsalan.hussain;DOWN;HARD;1;CRITICAL - Host Unreachable (192.168.1.107)
[1322110800] CURRENT HOST STATE: localhost;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.06 ms
[1322110800] CURRENT HOST STATE: musewerx-72c7b0;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.27 ms
i use code to read it in this way,
String strpath="/var/nagios.log";
FileReader fr = new FileReader(strpath);
BufferedReader br = new BufferedReader(fr);
String ch;
int time=0;
String Conversion="";
do {
ch = br.readLine();
out.print(ch+"<br/>");
} while (ch != null);
fr.close();
I would prefer to read in reverse order using buffer reader
I had the same problem as described here. I want to look at lines in file in reverse order, from the end back to the start (The unix tac command will do it).
However my input files are fairly large so reading the whole file into memory, as in the other examples was not really a workable option for me.
Below is the class I came up with, it does use RandomAccessFile, but does not need any buffers, since it just retains pointers to the file itself, and works with the standard InputStream methods.
It works for my cases, and empty files and a few other things I've tried. Now I don't have Unicode characters or anything fancy, but as long as the lines are delimited by LF, and even if they have a LF + CR it should work.
Basic Usage is :
in = new BufferedReader (new InputStreamReader (new ReverseLineInputStream(file)));
while(true) {
String line = in.readLine();
if (line == null) {
break;
}
System.out.println("X:" + line);
}
Here is the main source:
package www.kosoft.util;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.RandomAccessFile;
public class ReverseLineInputStream extends InputStream {
RandomAccessFile in;
long currentLineStart = -1;
long currentLineEnd = -1;
long currentPos = -1;
long lastPosInFile = -1;
public ReverseLineInputStream(File file) throws FileNotFoundException {
in = new RandomAccessFile(file, "r");
currentLineStart = file.length();
currentLineEnd = file.length();
lastPosInFile = file.length() -1;
currentPos = currentLineEnd;
}
public void findPrevLine() throws IOException {
currentLineEnd = currentLineStart;
// There are no more lines, since we are at the beginning of the file and no lines.
if (currentLineEnd == 0) {
currentLineEnd = -1;
currentLineStart = -1;
currentPos = -1;
return;
}
long filePointer = currentLineStart -1;
while ( true) {
filePointer--;
// we are at start of file so this is the first line in the file.
if (filePointer < 0) {
break;
}
in.seek(filePointer);
int readByte = in.readByte();
// We ignore last LF in file. search back to find the previous LF.
if (readByte == 0xA && filePointer != lastPosInFile ) {
break;
}
}
// we want to start at pointer +1 so we are after the LF we found or at 0 the start of the file.
currentLineStart = filePointer + 1;
currentPos = currentLineStart;
}
public int read() throws IOException {
if (currentPos < currentLineEnd ) {
in.seek(currentPos++);
int readByte = in.readByte();
return readByte;
}
else if (currentPos < 0) {
return -1;
}
else {
findPrevLine();
return read();
}
}
}
Apache Commons IO has the ReversedLinesFileReader class for this now (well, since version 2.2).
So your code could be:
String strpath="/var/nagios.log";
ReversedLinesFileReader fr = new ReversedLinesFileReader(new File(strpath));
String ch;
int time=0;
String Conversion="";
do {
ch = fr.readLine();
out.print(ch+"<br/>");
} while (ch != null);
fr.close();
The ReverseLineInputStream posted above is exactly what I was looking for. The files I am reading are large and cannot be buffered.
There are a couple of bugs:
File is not closed
if the last line is not terminated the last 2 lines are returned on the first read.
Here is the corrected code:
package www.kosoft.util;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.io.RandomAccessFile;
public class ReverseLineInputStream extends InputStream {
RandomAccessFile in;
long currentLineStart = -1;
long currentLineEnd = -1;
long currentPos = -1;
long lastPosInFile = -1;
int lastChar = -1;
public ReverseLineInputStream(File file) throws FileNotFoundException {
in = new RandomAccessFile(file, "r");
currentLineStart = file.length();
currentLineEnd = file.length();
lastPosInFile = file.length() -1;
currentPos = currentLineEnd;
}
private void findPrevLine() throws IOException {
if (lastChar == -1) {
in.seek(lastPosInFile);
lastChar = in.readByte();
}
currentLineEnd = currentLineStart;
// There are no more lines, since we are at the beginning of the file and no lines.
if (currentLineEnd == 0) {
currentLineEnd = -1;
currentLineStart = -1;
currentPos = -1;
return;
}
long filePointer = currentLineStart -1;
while ( true) {
filePointer--;
// we are at start of file so this is the first line in the file.
if (filePointer < 0) {
break;
}
in.seek(filePointer);
int readByte = in.readByte();
// We ignore last LF in file. search back to find the previous LF.
if (readByte == 0xA && filePointer != lastPosInFile ) {
break;
}
}
// we want to start at pointer +1 so we are after the LF we found or at 0 the start of the file.
currentLineStart = filePointer + 1;
currentPos = currentLineStart;
}
public int read() throws IOException {
if (currentPos < currentLineEnd ) {
in.seek(currentPos++);
int readByte = in.readByte();
return readByte;
} else if (currentPos > lastPosInFile && currentLineStart < currentLineEnd) {
// last line in file (first returned)
findPrevLine();
if (lastChar != '\n' && lastChar != '\r') {
// last line is not terminated
return '\n';
} else {
return read();
}
} else if (currentPos < 0) {
return -1;
} else {
findPrevLine();
return read();
}
}
#Override
public void close() throws IOException {
if (in != null) {
in.close();
in = null;
}
}
}
The proposed ReverseLineInputStream works really slow when you try to read thousands of lines. At my PC Intel Core i7 on SSD drive it was about 60k lines in 80 seconds. Here is the inspired optimized version with buffered reading (opposed to one-byte-at-a-time reading in ReverseLineInputStream). 60k lines log file is read in 400 milliseconds:
public class FastReverseLineInputStream extends InputStream {
private static final int MAX_LINE_BYTES = 1024 * 1024;
private static final int DEFAULT_BUFFER_SIZE = 1024 * 1024;
private RandomAccessFile in;
private long currentFilePos;
private int bufferSize;
private byte[] buffer;
private int currentBufferPos;
private int maxLineBytes;
private byte[] currentLine;
private int currentLineWritePos = 0;
private int currentLineReadPos = 0;
private boolean lineBuffered = false;
public ReverseLineInputStream(File file) throws IOException {
this(file, DEFAULT_BUFFER_SIZE, MAX_LINE_BYTES);
}
public ReverseLineInputStream(File file, int bufferSize, int maxLineBytes) throws IOException {
this.maxLineBytes = maxLineBytes;
in = new RandomAccessFile(file, "r");
currentFilePos = file.length() - 1;
in.seek(currentFilePos);
if (in.readByte() == 0xA) {
currentFilePos--;
}
currentLine = new byte[maxLineBytes];
currentLine[0] = 0xA;
this.bufferSize = bufferSize;
buffer = new byte[bufferSize];
fillBuffer();
fillLineBuffer();
}
#Override
public int read() throws IOException {
if (currentFilePos <= 0 && currentBufferPos < 0 && currentLineReadPos < 0) {
return -1;
}
if (!lineBuffered) {
fillLineBuffer();
}
if (lineBuffered) {
if (currentLineReadPos == 0) {
lineBuffered = false;
}
return currentLine[currentLineReadPos--];
}
return 0;
}
private void fillBuffer() throws IOException {
if (currentFilePos < 0) {
return;
}
if (currentFilePos < bufferSize) {
in.seek(0);
in.read(buffer);
currentBufferPos = (int) currentFilePos;
currentFilePos = -1;
} else {
in.seek(currentFilePos);
in.read(buffer);
currentBufferPos = bufferSize - 1;
currentFilePos = currentFilePos - bufferSize;
}
}
private void fillLineBuffer() throws IOException {
currentLineWritePos = 1;
while (true) {
// we've read all the buffer - need to fill it again
if (currentBufferPos < 0) {
fillBuffer();
// nothing was buffered - we reached the beginning of a file
if (currentBufferPos < 0) {
currentLineReadPos = currentLineWritePos - 1;
lineBuffered = true;
return;
}
}
byte b = buffer[currentBufferPos--];
// \n is found - line fully buffered
if (b == 0xA) {
currentLineReadPos = currentLineWritePos - 1;
lineBuffered = true;
break;
// just ignore \r for now
} else if (b == 0xD) {
continue;
} else {
if (currentLineWritePos == maxLineBytes) {
throw new IOException("file has a line exceeding " + maxLineBytes
+ " bytes; use constructor to pickup bigger line buffer");
}
// write the current line bytes in reverse order - reading from
// the end will produce the correct line
currentLine[currentLineWritePos++] = b;
}
}
}}
#Test
public void readAndPrintInReverseOrder() throws IOException {
String path = "src/misctests/test.txt";
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader(path));
Stack<String> lines = new Stack<String>();
String line = br.readLine();
while(line != null) {
lines.push(line);
line = br.readLine();
}
while(! lines.empty()) {
System.out.println(lines.pop());
}
} finally {
if(br != null) {
try {
br.close();
} catch(IOException e) {
// can't help it
}
}
}
}
Note that this code reads the hole file into memory and then starts printing it. This is the only way you can do it with a buffered reader or anry other reader that does not support seeking. You have to keep this in mind, in your case you want to read a log file, log files can be very big!
If you want to read line by line and print on the fly then you have no other alternative than using a reader that support seeking such as java.io.RandomAccessFile and this anything but trivial.
As far as I understand, you try to read backwards line by line.
Suppose this is the file you try to read:
line1
line2
line3
And you want to write it to the output stream of the servlet as follows:
line3
line2
line1
Following code might be helpful in this case:
List<String> tmp = new ArrayList<String>();
do {
ch = br.readLine();
tmp.add(ch);
out.print(ch+"<br/>");
} while (ch != null);
for(int i=tmp.size()-1;i>=0;i--) {
out.print(tmp.get(i)+"<br/>");
}
I had a problem with your solution #dpetruha because of this:
Does RandomAccessFile.read() from local file guarantee that exact number of bytes will be read?
Here is my solution: (changed only fillBuffer)
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.RandomAccessFile;
public class ReverseLineInputStream extends InputStream {
private static final int MAX_LINE_BYTES = 1024 * 1024;
private static final int DEFAULT_BUFFER_SIZE = 1024 * 1024;
private RandomAccessFile in;
private long currentFilePos;
private int bufferSize;
private byte[] buffer;
private int currentBufferPos;
private int maxLineBytes;
private byte[] currentLine;
private int currentLineWritePos = 0;
private int currentLineReadPos = 0;
private boolean lineBuffered = false;
public ReverseLineInputStream(File file) throws IOException {
this(file, DEFAULT_BUFFER_SIZE, MAX_LINE_BYTES);
}
public ReverseLineInputStream(File file, int bufferSize, int maxLineBytes) throws IOException {
this.maxLineBytes = maxLineBytes;
in = new RandomAccessFile(file, "r");
currentFilePos = file.length() - 1;
in.seek(currentFilePos);
if (in.readByte() == 0xA) {
currentFilePos--;
}
currentLine = new byte[maxLineBytes];
currentLine[0] = 0xA;
this.bufferSize = bufferSize;
buffer = new byte[bufferSize];
fillBuffer();
fillLineBuffer();
}
#Override
public int read() throws IOException {
if (currentFilePos <= 0 && currentBufferPos < 0 && currentLineReadPos < 0) {
return -1;
}
if (!lineBuffered) {
fillLineBuffer();
}
if (lineBuffered) {
if (currentLineReadPos == 0) {
lineBuffered = false;
}
return currentLine[currentLineReadPos--];
}
return 0;
}
private void fillBuffer() throws IOException {
if (currentFilePos < 0) {
return;
}
if (currentFilePos < bufferSize) {
in.seek(0);
buffer = new byte[(int) currentFilePos + 1];
in.readFully(buffer);
currentBufferPos = (int) currentFilePos;
currentFilePos = -1;
} else {
in.seek(currentFilePos - buffer.length);
in.readFully(buffer);
currentBufferPos = bufferSize - 1;
currentFilePos = currentFilePos - bufferSize;
}
}
private void fillLineBuffer() throws IOException {
currentLineWritePos = 1;
while (true) {
// we've read all the buffer - need to fill it again
if (currentBufferPos < 0) {
fillBuffer();
// nothing was buffered - we reached the beginning of a file
if (currentBufferPos < 0) {
currentLineReadPos = currentLineWritePos - 1;
lineBuffered = true;
return;
}
}
byte b = buffer[currentBufferPos--];
// \n is found - line fully buffered
if (b == 0xA) {
currentLineReadPos = currentLineWritePos - 1;
lineBuffered = true;
break;
// just ignore \r for now
} else if (b == 0xD) {
continue;
} else {
if (currentLineWritePos == maxLineBytes) {
throw new IOException("file has a line exceeding " + maxLineBytes
+ " bytes; use constructor to pickup bigger line buffer");
}
// write the current line bytes in reverse order - reading from
// the end will produce the correct line
currentLine[currentLineWritePos++] = b;
}
}
}
}

How to implement a universal file loader in Java?

This is what I'm trying to do:
public String load(String path) {
//...
}
load("file:/tmp/foo.txt"); // loads by absolute file name
load("classpath:bar.txt"); // loads from classpath
I think it's possible to do with JDK, but can't find out how exactly.
I can think of two approaches:
Just write plain Java code to extract the "scheme" from those URI-like strings, and then dispatch to the different code to load the file in different ways.
Register a custom URL stream handler to deal with the "classpath" case and then use URL.openStream() to open the stream to read the object.
The package documentation for java.net has some information about how stream handlers are discovered.
From my libraries omino roundabout, the two methods you'll need... I need them everywhere. The resource reader is relative to a class, at least to know which jar to read. But the path can start with / to force it back to the top. Enjoy!
(You'll have to make our own top level wrapper to look for "file:" and "classpath:".)
see also http://code.google.com/p/omino-roundabout/
public static String readFile(String filePath)
{
File f = new File(filePath);
if (!f.exists())
return null;
String result = "";
try
{
FileReader in = new FileReader(f);
boolean doing = true;
char[] bunch = new char[10000];
int soFar = 0;
while (doing)
{
int got = in.read(bunch, 0, bunch.length);
if (got <= 0)
doing = false;
else
{
String k = new String(bunch, 0, got);
result += k;
soFar += got;
}
}
} catch (Exception e)
{
return null;
}
// Strip off the UTF-8 front, if present. We hate this. EF BB BF
// see http://stackoverflow.com/questions/4897876/reading-utf-8-bom-marker for example.
// Mysteriously, when I read those 3 chars, they come in as 212,170,248. Fine, empirically, I'll strip that, too.
if(result != null && result.length() >= 3)
{
int c0 = result.charAt(0);
int c1 = result.charAt(1);
int c2 = result.charAt(2);
boolean leadingBom = (c0 == 0xEF && c1 == 0xBB && c2 == 0xBF);
leadingBom |= (c0 == 212 && c1 == 170 && c2 == 248);
if(leadingBom)
result = result.substring(3);
}
// And because I'm a dictator, fix up the line feeds.
result = result.replaceAll("\\r\\n", "\n");
result = result.replaceAll("\\r","\n");
return result;
}
static public String readResource(Class<?> aClass,String srcResourcePath)
{
if(aClass == null || srcResourcePath==null || srcResourcePath.length() == 0)
return null;
StringBuffer resultB = new StringBuffer();
URL resourceURL = null;
try
{
resourceURL = aClass.getResource(srcResourcePath);
}
catch(Exception e) { /* leave result null */ }
if(resourceURL == null)
return null; // sorry.
try
{
InputStream is = resourceURL.openStream();
final int BLOCKSIZE = 13007;
byte[] bytes = new byte[BLOCKSIZE];
int bytesRead = 0;
while(bytesRead >= 0)
{
bytesRead = is.read(bytes);
if(bytesRead > 0)
{
char[] chars = new char[bytesRead];
for(int i = 0; i < bytesRead; i++)
chars[i] = (char)bytes[i];
resultB.append(chars);
}
}
}
catch(IOException e)
{
return null; // sorry
}
String result = resultB.toString();
return result;
}
(edit -- removed a stray reference to OmString, to keep it standalone here.)

Number of lines in a file in Java

I use huge data files, sometimes I only need to know the number of lines in these files, usually I open them up and read them line by line until I reach the end of the file
I was wondering if there is a smarter way to do that
This is the fastest version I have found so far, about 6 times faster than readLines. On a 150MB log file this takes 0.35 seconds, versus 2.40 seconds when using readLines(). Just for fun, linux' wc -l command takes 0.15 seconds.
public static int countLinesOld(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
EDIT, 9 1/2 years later: I have practically no java experience, but anyways I have tried to benchmark this code against the LineNumberReader solution below since it bothered me that nobody did it. It seems that especially for large files my solution is faster. Although it seems to take a few runs until the optimizer does a decent job. I've played a bit with the code, and have produced a new version that is consistently fastest:
public static int countLinesNew(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int readChars = is.read(c);
if (readChars == -1) {
// bail out if nothing to read
return 0;
}
// make it easy for the optimizer to tune this loop
int count = 0;
while (readChars == 1024) {
for (int i=0; i<1024;) {
if (c[i++] == '\n') {
++count;
}
}
readChars = is.read(c);
}
// count remaining characters
while (readChars != -1) {
for (int i=0; i<readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
readChars = is.read(c);
}
return count == 0 ? 1 : count;
} finally {
is.close();
}
}
Benchmark resuls for a 1.3GB text file, y axis in seconds. I've performed 100 runs with the same file, and measured each run with System.nanoTime(). You can see that countLinesOld has a few outliers, and countLinesNew has none and while it's only a bit faster, the difference is statistically significant. LineNumberReader is clearly slower.
I have implemented another solution to the problem, I found it more efficient in counting rows:
try
(
FileReader input = new FileReader("input.txt");
LineNumberReader count = new LineNumberReader(input);
)
{
while (count.skip(Long.MAX_VALUE) > 0)
{
// Loop just in case the file is > Long.MAX_VALUE or skip() decides to not read the entire file
}
result = count.getLineNumber() + 1; // +1 because line index starts at 0
}
The accepted answer has an off by one error for multi line files which don't end in newline. A one line file ending without a newline would return 1, but a two line file ending without a newline would return 1 too. Here's an implementation of the accepted solution which fixes this. The endsWithoutNewLine checks are wasteful for everything but the final read, but should be trivial time wise compared to the overall function.
public int count(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean endsWithoutNewLine = false;
while ((readChars = is.read(c)) != -1) {
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n')
++count;
}
endsWithoutNewLine = (c[readChars - 1] != '\n');
}
if(endsWithoutNewLine) {
++count;
}
return count;
} finally {
is.close();
}
}
With java-8, you can use streams:
try (Stream<String> lines = Files.lines(path, Charset.defaultCharset())) {
long numOfLines = lines.count();
...
}
The answer with the method count() above gave me line miscounts if a file didn't have a newline at the end of the file - it failed to count the last line in the file.
This method works better for me:
public int countLines(String filename) throws IOException {
LineNumberReader reader = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}
cnt = reader.getLineNumber();
reader.close();
return cnt;
}
I tested the above methods for counting lines and here are my observations for Different methods as tested on my system
File Size : 1.6 Gb
Methods:
Using Scanner : 35s approx
Using BufferedReader : 5s approx
Using Java 8 : 5s approx
Using LineNumberReader : 5s approx
Moreover Java8 Approach seems quite handy :
Files.lines(Paths.get(filePath), Charset.defaultCharset()).count()
[Return type : long]
I know this is an old question, but the accepted solution didn't quite match what I needed it to do. So, I refined it to accept various line terminators (rather than just line feed) and to use a specified character encoding (rather than ISO-8859-n). All in one method (refactor as appropriate):
public static long getLinesCount(String fileName, String encodingName) throws IOException {
long linesCount = 0;
File file = new File(fileName);
FileInputStream fileIn = new FileInputStream(file);
try {
Charset encoding = Charset.forName(encodingName);
Reader fileReader = new InputStreamReader(fileIn, encoding);
int bufferSize = 4096;
Reader reader = new BufferedReader(fileReader, bufferSize);
char[] buffer = new char[bufferSize];
int prevChar = -1;
int readCount = reader.read(buffer);
while (readCount != -1) {
for (int i = 0; i < readCount; i++) {
int nextChar = buffer[i];
switch (nextChar) {
case '\r': {
// The current line is terminated by a carriage return or by a carriage return immediately followed by a line feed.
linesCount++;
break;
}
case '\n': {
if (prevChar == '\r') {
// The current line is terminated by a carriage return immediately followed by a line feed.
// The line has already been counted.
} else {
// The current line is terminated by a line feed.
linesCount++;
}
break;
}
}
prevChar = nextChar;
}
readCount = reader.read(buffer);
}
if (prevCh != -1) {
switch (prevCh) {
case '\r':
case '\n': {
// The last line is terminated by a line terminator.
// The last line has already been counted.
break;
}
default: {
// The last line is terminated by end-of-file.
linesCount++;
}
}
}
} finally {
fileIn.close();
}
return linesCount;
}
This solution is comparable in speed to the accepted solution, about 4% slower in my tests (though timing tests in Java are notoriously unreliable).
/**
* Count file rows.
*
* #param file file
* #return file row count
* #throws IOException
*/
public static long getLineCount(File file) throws IOException {
try (Stream<String> lines = Files.lines(file.toPath())) {
return lines.count();
}
}
Tested on JDK8_u31. But indeed performance is slow compared to this method:
/**
* Count file rows.
*
* #param file file
* #return file row count
* #throws IOException
*/
public static long getLineCount(File file) throws IOException {
try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file), 1024)) {
byte[] c = new byte[1024];
boolean empty = true,
lastEmpty = false;
long count = 0;
int read;
while ((read = is.read(c)) != -1) {
for (int i = 0; i < read; i++) {
if (c[i] == '\n') {
count++;
lastEmpty = true;
} else if (lastEmpty) {
lastEmpty = false;
}
}
empty = false;
}
if (!empty) {
if (count == 0) {
count = 1;
} else if (!lastEmpty) {
count++;
}
}
return count;
}
}
Tested and very fast.
A straight-forward way using Scanner
static void lineCounter (String path) throws IOException {
int lineCount = 0, commentsCount = 0;
Scanner input = new Scanner(new File(path));
while (input.hasNextLine()) {
String data = input.nextLine();
if (data.startsWith("//")) commentsCount++;
lineCount++;
}
System.out.println("Line Count: " + lineCount + "\t Comments Count: " + commentsCount);
}
I concluded that wc -l:s method of counting newlines is fine but returns non-intuitive results on files where the last line doesn't end with a newline.
And #er.vikas solution based on LineNumberReader but adding one to the line count returned non-intuitive results on files where the last line does end with newline.
I therefore made an algo which handles as follows:
#Test
public void empty() throws IOException {
assertEquals(0, count(""));
}
#Test
public void singleNewline() throws IOException {
assertEquals(1, count("\n"));
}
#Test
public void dataWithoutNewline() throws IOException {
assertEquals(1, count("one"));
}
#Test
public void oneCompleteLine() throws IOException {
assertEquals(1, count("one\n"));
}
#Test
public void twoCompleteLines() throws IOException {
assertEquals(2, count("one\ntwo\n"));
}
#Test
public void twoLinesWithoutNewlineAtEnd() throws IOException {
assertEquals(2, count("one\ntwo"));
}
#Test
public void aFewLines() throws IOException {
assertEquals(5, count("one\ntwo\nthree\nfour\nfive\n"));
}
And it looks like this:
static long countLines(InputStream is) throws IOException {
try(LineNumberReader lnr = new LineNumberReader(new InputStreamReader(is))) {
char[] buf = new char[8192];
int n, previousN = -1;
//Read will return at least one byte, no need to buffer more
while((n = lnr.read(buf)) != -1) {
previousN = n;
}
int ln = lnr.getLineNumber();
if (previousN == -1) {
//No data read at all, i.e file was empty
return 0;
} else {
char lastChar = buf[previousN - 1];
if (lastChar == '\n' || lastChar == '\r') {
//Ending with newline, deduct one
return ln;
}
}
//normal case, return line number + 1
return ln + 1;
}
}
If you want intuitive results, you may use this. If you just want wc -l compatibility, simple use #er.vikas solution, but don't add one to the result and retry the skip:
try(LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")))) {
while(lnr.skip(Long.MAX_VALUE) > 0){};
return lnr.getLineNumber();
}
How about using the Process class from within Java code? And then reading the output of the command.
Process p = Runtime.getRuntime().exec("wc -l " + yourfilename);
p.waitFor();
BufferedReader b = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = "";
int lineCount = 0;
while ((line = b.readLine()) != null) {
System.out.println(line);
lineCount = Integer.parseInt(line);
}
Need to try it though. Will post the results.
It seems that there are a few different approaches you can take with LineNumberReader.
I did this:
int lines = 0;
FileReader input = new FileReader(fileLocation);
LineNumberReader count = new LineNumberReader(input);
String line = count.readLine();
if(count.ready())
{
while(line != null) {
lines = count.getLineNumber();
line = count.readLine();
}
lines+=1;
}
count.close();
System.out.println(lines);
Even more simply, you can use the Java BufferedReader lines() Method to return a stream of the elements, and then use the Stream count() method to count all of the elements. Then simply add one to the output to get the number of rows in the text file.
As example:
FileReader input = new FileReader(fileLocation);
LineNumberReader count = new LineNumberReader(input);
int lines = (int)count.lines().count() + 1;
count.close();
System.out.println(lines);
This funny solution works really good actually!
public static int countLines(File input) throws IOException {
try (InputStream is = new FileInputStream(input)) {
int count = 1;
for (int aChar = 0; aChar != -1;aChar = is.read())
count += aChar == '\n' ? 1 : 0;
return count;
}
}
On Unix-based systems, use the wc command on the command-line.
Only way to know how many lines there are in file is to count them. You can of course create a metric from your data giving you an average length of one line and then get the file size and divide that with avg. length but that won't be accurate.
If you don't have any index structures, you'll not get around the reading of the complete file. But you can optimize it by avoiding to read it line by line and use a regex to match all line terminators.
Best Optimized code for multi line files having no newline('\n') character at EOF.
/**
*
* #param filename
* #return
* #throws IOException
*/
public static int countLines(String filename) throws IOException {
int count = 0;
boolean empty = true;
FileInputStream fis = null;
InputStream is = null;
try {
fis = new FileInputStream(filename);
is = new BufferedInputStream(fis);
byte[] c = new byte[1024];
int readChars = 0;
boolean isLine = false;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if ( c[i] == '\n' ) {
isLine = false;
++count;
}else if(!isLine && c[i] != '\n' && c[i] != '\r'){ //Case to handle line count where no New Line character present at EOF
isLine = true;
}
}
}
if(isLine){
++count;
}
}catch(IOException e){
e.printStackTrace();
}finally {
if(is != null){
is.close();
}
if(fis != null){
fis.close();
}
}
LOG.info("count: "+count);
return (count == 0 && !empty) ? 1 : count;
}
Scanner with regex:
public int getLineCount() {
Scanner fileScanner = null;
int lineCount = 0;
Pattern lineEndPattern = Pattern.compile("(?m)$");
try {
fileScanner = new Scanner(new File(filename)).useDelimiter(lineEndPattern);
while (fileScanner.hasNext()) {
fileScanner.next();
++lineCount;
}
}catch(FileNotFoundException e) {
e.printStackTrace();
return lineCount;
}
fileScanner.close();
return lineCount;
}
Haven't clocked it.
if you use this
public int countLines(String filename) throws IOException {
LineNumberReader reader = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}
cnt = reader.getLineNumber();
reader.close();
return cnt;
}
you cant run to big num rows, likes 100K rows, because return from reader.getLineNumber is int. you need long type of data to process maximum rows..

Categories

Resources