Java : Read last n lines of a HUGE file - java

I want to read the last n lines of a very big file without reading the whole file into any buffer/memory area using Java.
I looked around the JDK APIs and Apache Commons I/O and am not able to locate one which is suitable for this purpose.
I was thinking of the way tail or less does it in UNIX. I don't think they load the entire file and then show the last few lines of the file. There should be similar way to do the same in Java too.

I found it the simplest way to do by using ReversedLinesFileReader from apache commons-io api.
This method will give you the line from bottom to top of a file and you can specify n_lines value to specify the number of line.
import org.apache.commons.io.input.ReversedLinesFileReader;
File file = new File("D:\\file_name.xml");
int n_lines = 10;
int counter = 0;
ReversedLinesFileReader object = new ReversedLinesFileReader(file);
while(counter < n_lines) {
System.out.println(object.readLine());
counter++;
}

If you use a RandomAccessFile, you can use length and seek to get to a specific point near the end of the file and then read forward from there.
If you find there weren't enough lines, back up from that point and try again. Once you've figured out where the Nth last line begins, you can seek to there and just read-and-print.
An initial best-guess assumption can be made based on your data properties. For example, if it's a text file, it's possible the line lengths won't exceed an average of 132 so, to get the last five lines, start 660 characters before the end. Then, if you were wrong, try again at 1320 (you can even use what you learned from the last 660 characters to adjust that - example: if those 660 characters were just three lines, the next try could be 660 / 3 * 5, plus maybe a bit extra just in case).

RandomAccessFile is a good place to start, as described by the other answers. There is one important caveat though.
If your file is not encoded with an one-byte-per-character encoding, the readLine() method is not going to work for you. And readUTF() won't work in any circumstances. (It reads a string preceded by a character count ...)
Instead, you will need to make sure that you look for end-of-line markers in a way that respects the encoding's character boundaries. For fixed length encodings (e.g. flavors of UTF-16 or UTF-32) you need to extract characters starting from byte positions that are divisible by the character size in bytes. For variable length encodings (e.g. UTF-8), you need to search for a byte that must be the first byte of a character.
In the case of UTF-8, the first byte of a character will be 0xxxxxxx or 110xxxxx or 1110xxxx or 11110xxx. Anything else is either a second / third byte, or an illegal UTF-8 sequence. See The Unicode Standard, Version 5.2, Chapter 3.9, Table 3-7. This means, as the comment discussion points out, that any 0x0A and 0x0D bytes in a properly encoded UTF-8 stream will represent a LF or CR character. Thus, simply counting the 0x0A and 0x0D bytes is a valid implementation strategy (for UTF-8) if we can assume that the other kinds of Unicode line separator (0x2028, 0x2029 and 0x0085) are not used. You can't assume that, then the code would be more complicated.
Having identified a proper character boundary, you can then just call new String(...) passing the byte array, offset, count and encoding, and then repeatedly call String.lastIndexOf(...) to count end-of-lines.

The ReversedLinesFileReader can be found in the Apache Commons IO java library.
int n_lines = 1000;
ReversedLinesFileReader object = new ReversedLinesFileReader(new File(path));
String result="";
for(int i=0;i<n_lines;i++){
String line=object.readLine();
if(line==null)
break;
result+=line;
}
return result;

I found RandomAccessFile and other Buffer Reader classes too slow for me. Nothing can be faster than a tail -<#lines>. So this it was the best solution for me.
public String getLastNLogLines(File file, int nLines) {
StringBuilder s = new StringBuilder();
try {
Process p = Runtime.getRuntime().exec("tail -"+nLines+" "+file);
java.io.BufferedReader input = new java.io.BufferedReader(new java.io.InputStreamReader(p.getInputStream()));
String line = null;
//Here we first read the next line into the variable
//line and then check for the EOF condition, which
//is the return value of null
while((line = input.readLine()) != null){
s.append(line+'\n');
}
} catch (java.io.IOException e) {
e.printStackTrace();
}
return s.toString();
}

CircularFifoBuffer from apache commons . answer from a similar question at How to read last 5 lines of a .txt file into java
Note that in Apache Commons Collections 4 this class seems to have been renamed to CircularFifoQueue

package com.uday;
import java.io.File;
import java.io.RandomAccessFile;
public class TailN {
public static void main(String[] args) throws Exception {
long startTime = System.currentTimeMillis();
TailN tailN = new TailN();
File file = new File("/Users/udakkuma/Documents/workspace/uday_cancel_feature/TestOOPS/src/file.txt");
tailN.readFromLast(file);
System.out.println("Execution Time : " + (System.currentTimeMillis() - startTime));
}
public void readFromLast(File file) throws Exception {
int lines = 3;
int readLines = 0;
StringBuilder builder = new StringBuilder();
try (RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r")) {
long fileLength = file.length() - 1;
// Set the pointer at the last of the file
randomAccessFile.seek(fileLength);
for (long pointer = fileLength; pointer >= 0; pointer--) {
randomAccessFile.seek(pointer);
char c;
// read from the last, one char at the time
c = (char) randomAccessFile.read();
// break when end of the line
if (c == '\n') {
readLines++;
if (readLines == lines)
break;
}
builder.append(c);
fileLength = fileLength - pointer;
}
// Since line is read from the last so it is in reverse order. Use reverse
// method to make it correct order
builder.reverse();
System.out.println(builder.toString());
}
}
}

A RandomAccessFile allows for seeking (http://download.oracle.com/javase/1.4.2/docs/api/java/io/RandomAccessFile.html). The File.length method will return the size of the file. The problem is determining number of lines. For this, you can seek to the end of the file and read backwards until you have hit the right number of lines.

I had similar problem, but I don't understood to another solutions.
I used this. I hope thats simple code.
// String filePathName = (direction and file name).
File f = new File(filePathName);
long fileLength = f.length(); // Take size of file [bites].
long fileLength_toRead = 0;
if (fileLength > 2000) {
// My file content is a table, I know one row has about e.g. 100 bites / characters.
// I used 1000 bites before file end to point where start read.
// If you don't know line length, use #paxdiablo advice.
fileLength_toRead = fileLength - 1000;
}
try (RandomAccessFile raf = new RandomAccessFile(filePathName, "r")) { // This row manage open and close file.
raf.seek(fileLength_toRead); // File will begin read at this bite.
String rowInFile = raf.readLine(); // First readed line usualy is not whole, I needn't it.
rowInFile = raf.readLine();
while (rowInFile != null) {
// Here I can readed lines (rowInFile) add to String[] array or ArriyList<String>.
// Later I can work with rows from array - last row is sometimes empty, etc.
rowInFile = raf.readLine();
}
}
catch (IOException e) {
//
}

Here is the working for this.
private static void printLastNLines(String filePath, int n) {
File file = new File(filePath);
StringBuilder builder = new StringBuilder();
try {
RandomAccessFile randomAccessFile = new RandomAccessFile(filePath, "r");
long pos = file.length() - 1;
randomAccessFile.seek(pos);
for (long i = pos - 1; i >= 0; i--) {
randomAccessFile.seek(i);
char c = (char) randomAccessFile.read();
if (c == '\n') {
n--;
if (n == 0) {
break;
}
}
builder.append(c);
}
builder.reverse();
System.out.println(builder.toString());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

Here is the best way I've found to do it. Simple and pretty fast and memory efficient.
public static void tail(File src, OutputStream out, int maxLines) throws FileNotFoundException, IOException {
BufferedReader reader = new BufferedReader(new FileReader(src));
String[] lines = new String[maxLines];
int lastNdx = 0;
for (String line=reader.readLine(); line != null; line=reader.readLine()) {
if (lastNdx == lines.length) {
lastNdx = 0;
}
lines[lastNdx++] = line;
}
OutputStreamWriter writer = new OutputStreamWriter(out);
for (int ndx=lastNdx; ndx != lastNdx-1; ndx++) {
if (ndx == lines.length) {
ndx = 0;
}
writer.write(lines[ndx]);
writer.write("\n");
}
writer.flush();
}

(See commend)
public String readFromLast(File file, int howMany) throws IOException {
int numLinesRead = 0;
StringBuilder builder = new StringBuilder();
try (RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r")) {
try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
long fileLength = file.length() - 1;
/*
* Set the pointer at the end of the file. If the file is empty, an IOException
* will be thrown
*/
randomAccessFile.seek(fileLength);
for (long pointer = fileLength; pointer >= 0; pointer--) {
randomAccessFile.seek(pointer);
byte b = (byte) randomAccessFile.read();
if (b == '\n') {
numLinesRead++;
// (Last line often terminated with a line separator)
if (numLinesRead == (howMany + 1))
break;
}
baos.write(b);
fileLength = fileLength - pointer;
}
/*
* Since line is read from the last so it is in reverse order. Use reverse
* method to make it ordered correctly
*/
byte[] a = baos.toByteArray();
int start = 0;
int mid = a.length / 2;
int end = a.length - 1;
while (start < mid) {
byte temp = a[end];
a[end] = a[start];
a[start] = temp;
start++;
end--;
}// End while
return new String(a).trim();
} // End inner try-with-resources
} // End outer try-with-resources
} // End method

I tried RandomAccessFile first and it was tedious to read the file backwards, repositioning the file pointer upon every read operation. So, I tried #Luca solution and I got the last few lines of the file as a string in just two lines in a few minutes.
InputStream inputStream = Runtime.getRuntime().exec("tail " + path.toFile()).getInputStream();
String tail = new BufferedReader(new InputStreamReader(inputStream)).lines().collect(Collectors.joining(System.lineSeparator()));

Code is 2 lines only
// Please specify correct Charset
ReversedLinesFileReader rlf = new ReversedLinesFileReader(file, StandardCharsets.UTF_8);
// read last 2 lines
System.out.println(rlf.toString(2));
Gradle:
implementation group: 'commons-io', name: 'commons-io', version: '2.11.0'
Maven:
<dependency>
<groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.11.0</version>
</dependency>

Related

Read a specific line form file in Java

I really do not want to do a duplicate question, but none of the answers on SO were implementable in my problem.
The answer in this question:
How to read a file from a certain offset in Java?
uses RandomAccessFile, but the implementations I found need all the file lines to have the same length.
How can I get List lines = readLinesFromLine(file);?
I tried
private static List<String> readRandomAccessFile(String filepath, int lineStart, int lineEnd, int charsPerLine, String delimiter) {
File file = new File(filepath);
String data = "";
int bytesPerLine = charsPerLine+2;
try{
RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r");
for (int i = lineStart; i <lineEnd ; i++) {
randomAccessFile.seek(bytesPerLine *i);
data = randomAccessFile.readLine();
dialogLineRead.add(data);
}
randomAccessFile.close();
}catch (Exception e){
e.printStackTrace();
}
String returnData = "";
for (int i = 0; i < dialogLineRead.size(); i++) {
returnData += dialogLineRead.get(i);
returnData+=delimiter;
}
return returnData;
But like I said charsPerLine has to be the same for each line.
I tried to count the chars of each line in a file, and store it in a list, but with a log file of 2gb, that takes to much ram.
Any ideas?
For a standard text file where you don't know the line lengths in advance, there's really no way around reading the whole thing line by line, like in this answer, for example.

Java - get line from Random access file based on offsets

I have a very large (11GB) .json file (yeah, whoever thought that a great idea?) that I need to sample (read k random lines).
I'm not very savvy in Java file IO but I have, of course, found this post:
How to get a random line of a text file in Java?
I'm dropping the accepted answer because it's clearly way too slow to read every single line of an 11GB file just to select one (or rather k) out of the about 100k lines.
Fortunately, there is a second suggestion posted there that I think might be of better use to me:
Use RandomAccessFile to seek to a random byte position in the file.
Seek left and right to the next line terminator. Let L the line between them.
With probability (MIN_LINE_LENGTH / L.length) return L. Otherwise, start over at step 1.
So far so good, but I was wondering about that "let L be the line between them".
I would have done something like this (untested):
RandomAccessFile raf = ...
long pos = ...
String line = getLine(raf,pos);
...
where
private String getLine(RandomAccessFile raf, long start) throws IOException{
long pos = (start % 2 == 0) ? start : start -1;
if(pos == 0) return raf.readLine();
do{
pos -= 2;
raf.seek(pos);
}while(pos > 0 && raf.readChar() != '\n');
pos = (pos <= 0) ? 0 : pos + 2;
raf.seek(pos);
return raf.readLine();
}
and then operated with line.length(), which forgoes the need to explicitly seek the right end of the line.
So why "seek left and right to the next line terminator"?
Is there a more convenient way to get the line from these two offsets?
It looks like this would do approximately the same - raf.readLine() is seeking right to the next line terminator; it's just doing it for you.
One thing to note is that RandomAccessFile.readLine() doesn't support reading unicode strings from the file:
Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.
Demo of the incorrect reading:
import java.io.*;
import java.nio.charset.StandardCharsets;
class Demo {
public static void main(String[] args) throws IOException {
try (FileOutputStream fos = new FileOutputStream("output.txt");
OutputStreamWriter osw = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
BufferedWriter writer = new BufferedWriter(osw)) {
writer.write("ⵉⵎⴰⵣⵉⵖⵏ");
}
try (RandomAccessFile raf = new RandomAccessFile("output.txt", "r")) {
System.out.println(raf.readLine());
}
}
}
Output:
âµâµâ´°âµ£âµâµâµ
But output.txt does contain the correct data:
$ cat output.txt
ⵉⵎⴰⵣⵉⵖⵏ
As such, you might want to do the seeking yourself, or explicitly convert the result of raf.readLine() to the correct charset:
String line = new String(
raf.readLine().getBytes(StandardCharsets.ISO_8859_1),
StandardCharsets.UTF_8);

Does RandomAccessFile in java read entire file in memory?

I need to read last n lines from a large file (say 2GB). The file is UTF-8 encoded.
Would like to know the most efficient way of doing it. Read about RandomAccessFile in java, but does the seek() method , read the entire file in memory. It uses native implementation so i wasn't able to refer the source code.
RandomAccessFile.seek just sets the file-pointer current position, no bytes are read into memory.
Since your file is UTF-8 encoded, it is a text file. For reading text files we typically use BufferedReader, Java 7 even added a convinience method File.newBufferedReader to create an instance of a BufferedReader to read text from a file. Though it may be inefficient for reading last n lines, but easy to implement.
To be efficient we need RandomAccessFile and read file backwards starting from the end. Here is a basic example
public static void main(String[] args) throws Exception {
int n = 3;
List<String> lines = new ArrayList<>();
try (RandomAccessFile f = new RandomAccessFile("test", "r")) {
ByteArrayOutputStream bout = new ByteArrayOutputStream();
for (long length = f.length(), p = length - 1; p > 0 && lines.size() < n; p--) {
f.seek(p);
int b = f.read();
if (b == 10) {
if (p < length - 1) {
lines.add(0, getLine(bout));
bout.reset();
}
} else if (b != 13) {
bout.write(b);
}
}
}
System.out.println(lines);
}
static String getLine(ByteArrayOutputStream bout) {
byte[] a = bout.toByteArray();
// reverse bytes
for (int i = 0, j = a.length - 1; j > i; i++, j--) {
byte tmp = a[j];
a[j] = a[i];
a[i] = tmp;
}
return new String(a);
}
It reads the file byte after byte starting from tail to ByteArrayOutputStream, when LF is reached it reverses the bytes and creates a line.
Two things need to be improved:
buffering
EOL recognition
If you need Random Access, you need RandomAccessFile. You can convert the bytes you get from this into UTF-8 if you know what you are doing.
If you use BuffredReader, you can use skip(n) by number of characters which means it has to read the whole file.
A way to do this in combination; is to use FileInputStream with skip(), find where you want to read from by reading back N newlines and then wrap the stream in BufferedReader to read the lines with UTF-8 encoding.

Binary Search using Java on a UTF-8 encoded text file where line size is not fixed

I have a tab separated UTF-8 file, where the records are sorted on one field. But, the line size is not fixed, so cannot jump into a particular position directly. How can I perform binary search on this?
Example:
line 1: Alfred Brendel /m/011hww /m/0crsgs6,/m/0crvt9h,/m/0cs5n_1,/m/0crtj4t,/m/0crwpnw,/m/0cr_n2s,/m/0crsgyh
line 2: Rupert Sheldrake /m/011ybj /m/0crtszs
You know the number of bytes your hole file contains. Lets say n
-> search-interval [l, r] with l=0, r=n.
Estimate the middle of your search-interval m=(r-l)/2. At this location go as much bytes to the left (right would also work) until you find a tab-character (byte==9 (9 is the ASCII and UTF8 code for a tab)) [lets name this position mReal ] and decode the one line starting that tab.
determine if you have to take the first 'half' (=> new search-interval is [l, mReal]) or the second 'half' (=> new search-interval is [mReal, r]) for the next search step.
public class YourTokenizer {
public static final String EPF_EOL = "\t";
public static final int READ_SIZE = 4 * 1024 ;
/** The EPF stream buffer. */
private StringBuilder buffer = new StringBuilder();
/** The EPF stream. */
private InputStream stream = null;
public YourTokenizer(final InputStream stream) {
this.stream = stream;
}
private String getNextLine() throws IOException {
int pos = buffer.indexOf(EPF_EOL);
if (pos == -1) {
// eof-of-line sequence isn't available yet, read more of the file
final byte[] bytes = new byte[READ_SIZE];
final int readSize = stream.read(bytes, 0, READ_SIZE);
buffer.append(new String(bytes));
pos = buffer.indexOf(EPF_EOL);
if (pos == -1) {
if (readSize < READ_SIZE) {
// we have reached the end of the stream and what we're looking for still can't be found
throw new IOException("Premature end of stream");
}
return getNextLine();
}
}
final String data = buffer.substring(0, pos);
pos += EPF_EOL.length();
buffer = buffer.delete(0, pos);
return data;
}
}
end in main :
final InputStream stream = new FileInputStream(file);
final YourTokenizer tokenizer = new YourTokenizer(stream);
String line = tokenizer.getNextLine();
while(line != line) {
//do something
line = tokenizer.getNextLine();
}
You can jump to the middle of bytes. From there you can find the end of that line and you can read the next line from that point. If you need to search back, take a one quarter point, or three quarters and find the line each time. Eventually you will narrow it down to one line.
I think you can guess the line length from the file size
Yet When you can't even guess the length of the lines then I think it will be better to chose from generating a random number.

Read a definite number of lines in a text file, using java

I have a text file with data. The file has information from all months. Imagine that the information for January occupy 50 lines. Than February starts and it occupies 40 more lines. Than I have March and so on... Is it possible to read only part of the file? Can I say "read from line X to line Y"? or is there a better way to accomplish this? I only want to print the data correspondent to one month not the all file. Here is my code
public static void readFile()
{
try
{
DataInputStream inputStream =
new DataInputStream(new FileInputStream("SpreadsheetDatabase2013.txt"));
while(inputStream.available() != 0)
{
System.out.println("AVAILABLE: " + inputStream.available());
System.out.println(inputStream.readUTF());
System.out.println(inputStream.readInt());
for (int i = 0; i < 40; i++)
{
System.out.println(inputStream.readUTF());
System.out.println(inputStream.readUTF());
System.out.println(inputStream.readUTF());
System.out.println(inputStream.readUTF());
System.out.println(inputStream.readUTF());
System.out.println(inputStream.readDouble());
System.out.println(inputStream.readUTF());
System.out.println(inputStream.readBoolean());
System.out.println();
}
}// end while
inputStream.close();
}// end try
catch (Exception e)
{
System.out.println("An error has occurred.");
}//end catch
}//end method
Thank you for your time.
My approach to this would be to read the entire contents of the text file and store it in a ArrayList and read only the lines for the requested month.
Example:
Use this function to read the all the lines from the file.
/**
* Read from a file specified by the filePath.
*
* #param filePath
* The path of the file.
* #return List of lines in the file.
* #throws IOException
*/
public static ArrayList<String> readFromFile(String filePath)
throws IOException {
ArrayList<String> temp = new ArrayList<String>();
File file = new File(filePath);
if (file.exists()) {
BufferedReader brin;
brin = new BufferedReader(new FileReader(filePath));
String line = brin.readLine();
while (line != null) {
if (!line.equals(""))
temp.add(line);
line = brin.readLine();
}
brin.close();
}
return temp;
}
Then read only the ones you need from ArrayList temp.
Example:
if you want to read February month's data assuming its 50 lines of data and starts from 40th line.
for(int i=40;i<90;i++)
{
System.out.println(temp.get(i));
}
Note: This is only just one way of doing this. I am not certain if there is any other way!
I would use the scanner class.
Scanner scanner = new Scanner(filename);
Use scanner.nextLine() to get each of the lines of the file. If you only want from line x to line y you can use a for loop to scan each of the lines that you don't need before going through the scanner for the lines you do need. Be careful not to hit an exception without throwing it though.
Or you can go through the scanner and for each line, add the String contents of the line to an ArrayList. Good luck.
Based on how you said your data was organized, I would suggest doing something like this
ArrayList<String> temp = new ArrayList<String>();
int read = 0;
File file = new File(filePath);
if (file.exists()) {
BufferedReader brin;
brin = new BufferedReader(new FileReader(filePath));
String line = brin.readLine();
while (line != null) {
if (!line.equals("")){
if(line.equals("March"))
read = 1;
else if(line.equals("April"))
break;
else if(read == 1)
temp.add(line);
}
line = brin.readLine();
}
brin.close();
Just tried it myself, that'll take in all the data between March and April. You can adjust them as necessary or make them variables. Thanks to ngoa for the foundation code. Credit where credit is due
If you have Java 7, you can use Files.readAllLines(Path path, Charset cs), e.g.
Path path = // Path to "SpreadsheetDatabase2013.txt"
Charset charset = // "UTF-8" or whatever charset is used
List<String> allLines = Files.readAllLines(path, charset);
List<String> relevantLines = allLines.subList(x, y);
Where x (inclusive) and y (exclusive) indicates the line numbers that are of interest, see List.subList(int fromIndex, int toIndex).
One benefit of this solution, as stated in the JavaDoc of readAllLines():
This method ensures that the file is closed when all bytes have been read or an I/O error, or other runtime exception, is thrown.

Categories

Resources