Java - get line from Random access file based on offsets - java

I have a very large (11GB) .json file (yeah, whoever thought that a great idea?) that I need to sample (read k random lines).
I'm not very savvy in Java file IO but I have, of course, found this post:
How to get a random line of a text file in Java?
I'm dropping the accepted answer because it's clearly way too slow to read every single line of an 11GB file just to select one (or rather k) out of the about 100k lines.
Fortunately, there is a second suggestion posted there that I think might be of better use to me:
Use RandomAccessFile to seek to a random byte position in the file.
Seek left and right to the next line terminator. Let L the line between them.
With probability (MIN_LINE_LENGTH / L.length) return L. Otherwise, start over at step 1.
So far so good, but I was wondering about that "let L be the line between them".
I would have done something like this (untested):
RandomAccessFile raf = ...
long pos = ...
String line = getLine(raf,pos);
...
where
private String getLine(RandomAccessFile raf, long start) throws IOException{
long pos = (start % 2 == 0) ? start : start -1;
if(pos == 0) return raf.readLine();
do{
pos -= 2;
raf.seek(pos);
}while(pos > 0 && raf.readChar() != '\n');
pos = (pos <= 0) ? 0 : pos + 2;
raf.seek(pos);
return raf.readLine();
}
and then operated with line.length(), which forgoes the need to explicitly seek the right end of the line.
So why "seek left and right to the next line terminator"?
Is there a more convenient way to get the line from these two offsets?

It looks like this would do approximately the same - raf.readLine() is seeking right to the next line terminator; it's just doing it for you.
One thing to note is that RandomAccessFile.readLine() doesn't support reading unicode strings from the file:
Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.
Demo of the incorrect reading:
import java.io.*;
import java.nio.charset.StandardCharsets;
class Demo {
public static void main(String[] args) throws IOException {
try (FileOutputStream fos = new FileOutputStream("output.txt");
OutputStreamWriter osw = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
BufferedWriter writer = new BufferedWriter(osw)) {
writer.write("ⵉⵎⴰⵣⵉⵖⵏ");
}
try (RandomAccessFile raf = new RandomAccessFile("output.txt", "r")) {
System.out.println(raf.readLine());
}
}
}
Output:
âµâµâ´°âµ£âµâµâµ
But output.txt does contain the correct data:
$ cat output.txt
ⵉⵎⴰⵣⵉⵖⵏ
As such, you might want to do the seeking yourself, or explicitly convert the result of raf.readLine() to the correct charset:
String line = new String(
raf.readLine().getBytes(StandardCharsets.ISO_8859_1),
StandardCharsets.UTF_8);

Related

Removing n lines in a file [duplicate]

I'm trying to delete a line of text from a text file without copying to a temporary file. I am trying to do this by using a Printwriter and a Scanner and having them traverse the file at the same time, the writer writing what the Scanner reads and overwriting each line with the same thing, until it gets to the line that I wish to delete. Then, I advance the Scanner but not the writer, and continue as before. Here is the code:
But first, the parameters: My file names are numbers, so this would read 1.txt or 2.txt, etc, and so f specifies the file name. I convert it to a String in the constructor for a file. Int n is the index of the line that I want to delete.
public void deleteLine(int f, int n){
try{
Scanner reader = new Scanner(new File(f+".txt"));
PrintWriter writer = new PrintWriter(new FileWriter(new File(f+".txt")),false);
for(int w=0; w<n; w++)
writer.write(reader.nextLine());
reader.nextLine();
while(reader.hasNextLine())
writer.write(reader.nextLine());
} catch(Exception e){
System.err.println("Enjoy the stack trace!");
e.printStackTrace();
}
}
It gives me strange errors. It says "NoSuchElementException" and "no line found" in the stack trace. It points to different lines; it seems that any of the nextLine() calls can do this. Is it possible to delete a line this way? If so, what am I doing wrong? If not, why? (BTW, just in case you'd want this, the text file is about 500 lines. I don't know if that counts as large or even matters, though.)
As others have pointed out, you might be better off using a temporary file, if there's a slightest risk that your program crashes mid way:
public static void removeNthLine(String f, int toRemove) throws IOException {
File tmp = File.createTempFile("tmp", "");
BufferedReader br = new BufferedReader(new FileReader(f));
BufferedWriter bw = new BufferedWriter(new FileWriter(tmp));
for (int i = 0; i < toRemove; i++)
bw.write(String.format("%s%n", br.readLine()));
br.readLine();
String l;
while (null != (l = br.readLine()))
bw.write(String.format("%s%n", l));
br.close();
bw.close();
File oldFile = new File(f);
if (oldFile.delete())
tmp.renameTo(oldFile);
}
(Beware of the sloppy treatment of encodings, new-line characters and exception handling.)
However, I don't like answering questions with "I won't tell you how, because you shouldn't do it anyway.". (In some other situation for instance, you may be working with a file that's larger than half your hard drive!) So here goes:
You need to use a RandomAccessFile instead. Using this class you can both read and write to the file using the same object:
public static void removeNthLine(String f, int toRemove) throws IOException {
RandomAccessFile raf = new RandomAccessFile(f, "rw");
// Leave the n first lines unchanged.
for (int i = 0; i < toRemove; i++)
raf.readLine();
// Shift remaining lines upwards.
long writePos = raf.getFilePointer();
raf.readLine();
long readPos = raf.getFilePointer();
byte[] buf = new byte[1024];
int n;
while (-1 != (n = raf.read(buf))) {
raf.seek(writePos);
raf.write(buf, 0, n);
readPos += n;
writePos += n;
raf.seek(readPos);
}
raf.setLength(writePos);
raf.close();
}
You cannot do it this way. FileWriter can only append to a file, rather than write in the middle of it - You need RandomAccessFile if you want to write in the middle. What you do now - you override the file the first time you write to it (and it gets empty - that's why you get the exception). You can create FileWriter with append flag set to true - but this way you would append to a file rather than write in the middle of it.
I'd really recommend to write to a new file and then rename it at the end.
#shelley: you can't do what you are trying to do and what's more, you shouldn't. You should read the file and write to a temporary file for several reasons, for one, it's possible to do it this way (as opposed to what you're trying to do) and for another, if the process gets corrupted, you could bale out without loss of the original file. Now you could update a specific location of a file using a RandomAccessFile, but this is usually done (in my experience) when you are dealing with fixed sized records rather than typical text files.

Reading a String that has n length from InputStream or Reader

I know that I can do this. But I also want to know, is there a short way to do this ? For example: Why there is no method that has public String readString(int len); prototype in Reader class hierarchy to do what I want with only single code in this question ?
InputStream in = new FileInputStream("abc.txt");
InputStreamReader inReader = new InputStreamReader(in);
char[] foo = new char[5];
inReader.read(foo);
System.out.println(new String(foo));
// I think this way is too long
// for reading a string that has only 5 character
// from InputStream or Reader
In Python 3 programming language, I can do it very very easy for UTF-8 and another files. Consider the following code.
fl = open("abc.txt", mode="r", encoding="utf-8")
fl.read(1) # returns string that has 1 character
fl.read(3) # returns string that has 3 character
How can I dot it in Java ?
Thanks.
How can I do it in Java ?
The way you're already doing it.
I'd recommend doing it in a reusable helper method, e.g.
final class IOUtil {
public static String read(Reader in, int len) throws IOException {
char[] buf = new char[len];
int charsRead = in.read(buf);
return (charsRead == -1 ? null : new String(buf, 0, charsRead));
}
}
Then use it like this:
try (Reader in = Files.newBufferedReader(Paths.get("abc.txt"), StandardCharsets.UTF_8)) {
System.out.println(IOUtil.read(in, 5));
}
If you want to make a best effort to read as many as the specified number of characters, you may use
int len = 4;
String result;
try(Reader r = new FileReader("abc.txt")) {
CharBuffer b = CharBuffer.allocate(len);
do {} while(b.hasRemaining() && r.read(b) > 0);
result = b.flip().toString();
}
System.out.println(result);
While the Reader may read less than the specified characters (depending on the underlying stream), it will read at least one character before returning or return -1 to signal the end of the stream. So the code above will loop until either, having read the requested number of characters or reached the end of the stream.
Though, a FileReader will usually read all requested characters in one go and read only less when reaching the end of the file.

Does RandomAccessFile in java read entire file in memory?

I need to read last n lines from a large file (say 2GB). The file is UTF-8 encoded.
Would like to know the most efficient way of doing it. Read about RandomAccessFile in java, but does the seek() method , read the entire file in memory. It uses native implementation so i wasn't able to refer the source code.
RandomAccessFile.seek just sets the file-pointer current position, no bytes are read into memory.
Since your file is UTF-8 encoded, it is a text file. For reading text files we typically use BufferedReader, Java 7 even added a convinience method File.newBufferedReader to create an instance of a BufferedReader to read text from a file. Though it may be inefficient for reading last n lines, but easy to implement.
To be efficient we need RandomAccessFile and read file backwards starting from the end. Here is a basic example
public static void main(String[] args) throws Exception {
int n = 3;
List<String> lines = new ArrayList<>();
try (RandomAccessFile f = new RandomAccessFile("test", "r")) {
ByteArrayOutputStream bout = new ByteArrayOutputStream();
for (long length = f.length(), p = length - 1; p > 0 && lines.size() < n; p--) {
f.seek(p);
int b = f.read();
if (b == 10) {
if (p < length - 1) {
lines.add(0, getLine(bout));
bout.reset();
}
} else if (b != 13) {
bout.write(b);
}
}
}
System.out.println(lines);
}
static String getLine(ByteArrayOutputStream bout) {
byte[] a = bout.toByteArray();
// reverse bytes
for (int i = 0, j = a.length - 1; j > i; i++, j--) {
byte tmp = a[j];
a[j] = a[i];
a[i] = tmp;
}
return new String(a);
}
It reads the file byte after byte starting from tail to ByteArrayOutputStream, when LF is reached it reverses the bytes and creates a line.
Two things need to be improved:
buffering
EOL recognition
If you need Random Access, you need RandomAccessFile. You can convert the bytes you get from this into UTF-8 if you know what you are doing.
If you use BuffredReader, you can use skip(n) by number of characters which means it has to read the whole file.
A way to do this in combination; is to use FileInputStream with skip(), find where you want to read from by reading back N newlines and then wrap the stream in BufferedReader to read the lines with UTF-8 encoding.

Binary Search using Java on a UTF-8 encoded text file where line size is not fixed

I have a tab separated UTF-8 file, where the records are sorted on one field. But, the line size is not fixed, so cannot jump into a particular position directly. How can I perform binary search on this?
Example:
line 1: Alfred Brendel /m/011hww /m/0crsgs6,/m/0crvt9h,/m/0cs5n_1,/m/0crtj4t,/m/0crwpnw,/m/0cr_n2s,/m/0crsgyh
line 2: Rupert Sheldrake /m/011ybj /m/0crtszs
You know the number of bytes your hole file contains. Lets say n
-> search-interval [l, r] with l=0, r=n.
Estimate the middle of your search-interval m=(r-l)/2. At this location go as much bytes to the left (right would also work) until you find a tab-character (byte==9 (9 is the ASCII and UTF8 code for a tab)) [lets name this position mReal ] and decode the one line starting that tab.
determine if you have to take the first 'half' (=> new search-interval is [l, mReal]) or the second 'half' (=> new search-interval is [mReal, r]) for the next search step.
public class YourTokenizer {
public static final String EPF_EOL = "\t";
public static final int READ_SIZE = 4 * 1024 ;
/** The EPF stream buffer. */
private StringBuilder buffer = new StringBuilder();
/** The EPF stream. */
private InputStream stream = null;
public YourTokenizer(final InputStream stream) {
this.stream = stream;
}
private String getNextLine() throws IOException {
int pos = buffer.indexOf(EPF_EOL);
if (pos == -1) {
// eof-of-line sequence isn't available yet, read more of the file
final byte[] bytes = new byte[READ_SIZE];
final int readSize = stream.read(bytes, 0, READ_SIZE);
buffer.append(new String(bytes));
pos = buffer.indexOf(EPF_EOL);
if (pos == -1) {
if (readSize < READ_SIZE) {
// we have reached the end of the stream and what we're looking for still can't be found
throw new IOException("Premature end of stream");
}
return getNextLine();
}
}
final String data = buffer.substring(0, pos);
pos += EPF_EOL.length();
buffer = buffer.delete(0, pos);
return data;
}
}
end in main :
final InputStream stream = new FileInputStream(file);
final YourTokenizer tokenizer = new YourTokenizer(stream);
String line = tokenizer.getNextLine();
while(line != line) {
//do something
line = tokenizer.getNextLine();
}
You can jump to the middle of bytes. From there you can find the end of that line and you can read the next line from that point. If you need to search back, take a one quarter point, or three quarters and find the line each time. Eventually you will narrow it down to one line.
I think you can guess the line length from the file size
Yet When you can't even guess the length of the lines then I think it will be better to chose from generating a random number.

Java : Read last n lines of a HUGE file

I want to read the last n lines of a very big file without reading the whole file into any buffer/memory area using Java.
I looked around the JDK APIs and Apache Commons I/O and am not able to locate one which is suitable for this purpose.
I was thinking of the way tail or less does it in UNIX. I don't think they load the entire file and then show the last few lines of the file. There should be similar way to do the same in Java too.
I found it the simplest way to do by using ReversedLinesFileReader from apache commons-io api.
This method will give you the line from bottom to top of a file and you can specify n_lines value to specify the number of line.
import org.apache.commons.io.input.ReversedLinesFileReader;
File file = new File("D:\\file_name.xml");
int n_lines = 10;
int counter = 0;
ReversedLinesFileReader object = new ReversedLinesFileReader(file);
while(counter < n_lines) {
System.out.println(object.readLine());
counter++;
}
If you use a RandomAccessFile, you can use length and seek to get to a specific point near the end of the file and then read forward from there.
If you find there weren't enough lines, back up from that point and try again. Once you've figured out where the Nth last line begins, you can seek to there and just read-and-print.
An initial best-guess assumption can be made based on your data properties. For example, if it's a text file, it's possible the line lengths won't exceed an average of 132 so, to get the last five lines, start 660 characters before the end. Then, if you were wrong, try again at 1320 (you can even use what you learned from the last 660 characters to adjust that - example: if those 660 characters were just three lines, the next try could be 660 / 3 * 5, plus maybe a bit extra just in case).
RandomAccessFile is a good place to start, as described by the other answers. There is one important caveat though.
If your file is not encoded with an one-byte-per-character encoding, the readLine() method is not going to work for you. And readUTF() won't work in any circumstances. (It reads a string preceded by a character count ...)
Instead, you will need to make sure that you look for end-of-line markers in a way that respects the encoding's character boundaries. For fixed length encodings (e.g. flavors of UTF-16 or UTF-32) you need to extract characters starting from byte positions that are divisible by the character size in bytes. For variable length encodings (e.g. UTF-8), you need to search for a byte that must be the first byte of a character.
In the case of UTF-8, the first byte of a character will be 0xxxxxxx or 110xxxxx or 1110xxxx or 11110xxx. Anything else is either a second / third byte, or an illegal UTF-8 sequence. See The Unicode Standard, Version 5.2, Chapter 3.9, Table 3-7. This means, as the comment discussion points out, that any 0x0A and 0x0D bytes in a properly encoded UTF-8 stream will represent a LF or CR character. Thus, simply counting the 0x0A and 0x0D bytes is a valid implementation strategy (for UTF-8) if we can assume that the other kinds of Unicode line separator (0x2028, 0x2029 and 0x0085) are not used. You can't assume that, then the code would be more complicated.
Having identified a proper character boundary, you can then just call new String(...) passing the byte array, offset, count and encoding, and then repeatedly call String.lastIndexOf(...) to count end-of-lines.
The ReversedLinesFileReader can be found in the Apache Commons IO java library.
int n_lines = 1000;
ReversedLinesFileReader object = new ReversedLinesFileReader(new File(path));
String result="";
for(int i=0;i<n_lines;i++){
String line=object.readLine();
if(line==null)
break;
result+=line;
}
return result;
I found RandomAccessFile and other Buffer Reader classes too slow for me. Nothing can be faster than a tail -<#lines>. So this it was the best solution for me.
public String getLastNLogLines(File file, int nLines) {
StringBuilder s = new StringBuilder();
try {
Process p = Runtime.getRuntime().exec("tail -"+nLines+" "+file);
java.io.BufferedReader input = new java.io.BufferedReader(new java.io.InputStreamReader(p.getInputStream()));
String line = null;
//Here we first read the next line into the variable
//line and then check for the EOF condition, which
//is the return value of null
while((line = input.readLine()) != null){
s.append(line+'\n');
}
} catch (java.io.IOException e) {
e.printStackTrace();
}
return s.toString();
}
CircularFifoBuffer from apache commons . answer from a similar question at How to read last 5 lines of a .txt file into java
Note that in Apache Commons Collections 4 this class seems to have been renamed to CircularFifoQueue
package com.uday;
import java.io.File;
import java.io.RandomAccessFile;
public class TailN {
public static void main(String[] args) throws Exception {
long startTime = System.currentTimeMillis();
TailN tailN = new TailN();
File file = new File("/Users/udakkuma/Documents/workspace/uday_cancel_feature/TestOOPS/src/file.txt");
tailN.readFromLast(file);
System.out.println("Execution Time : " + (System.currentTimeMillis() - startTime));
}
public void readFromLast(File file) throws Exception {
int lines = 3;
int readLines = 0;
StringBuilder builder = new StringBuilder();
try (RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r")) {
long fileLength = file.length() - 1;
// Set the pointer at the last of the file
randomAccessFile.seek(fileLength);
for (long pointer = fileLength; pointer >= 0; pointer--) {
randomAccessFile.seek(pointer);
char c;
// read from the last, one char at the time
c = (char) randomAccessFile.read();
// break when end of the line
if (c == '\n') {
readLines++;
if (readLines == lines)
break;
}
builder.append(c);
fileLength = fileLength - pointer;
}
// Since line is read from the last so it is in reverse order. Use reverse
// method to make it correct order
builder.reverse();
System.out.println(builder.toString());
}
}
}
A RandomAccessFile allows for seeking (http://download.oracle.com/javase/1.4.2/docs/api/java/io/RandomAccessFile.html). The File.length method will return the size of the file. The problem is determining number of lines. For this, you can seek to the end of the file and read backwards until you have hit the right number of lines.
I had similar problem, but I don't understood to another solutions.
I used this. I hope thats simple code.
// String filePathName = (direction and file name).
File f = new File(filePathName);
long fileLength = f.length(); // Take size of file [bites].
long fileLength_toRead = 0;
if (fileLength > 2000) {
// My file content is a table, I know one row has about e.g. 100 bites / characters.
// I used 1000 bites before file end to point where start read.
// If you don't know line length, use #paxdiablo advice.
fileLength_toRead = fileLength - 1000;
}
try (RandomAccessFile raf = new RandomAccessFile(filePathName, "r")) { // This row manage open and close file.
raf.seek(fileLength_toRead); // File will begin read at this bite.
String rowInFile = raf.readLine(); // First readed line usualy is not whole, I needn't it.
rowInFile = raf.readLine();
while (rowInFile != null) {
// Here I can readed lines (rowInFile) add to String[] array or ArriyList<String>.
// Later I can work with rows from array - last row is sometimes empty, etc.
rowInFile = raf.readLine();
}
}
catch (IOException e) {
//
}
Here is the working for this.
private static void printLastNLines(String filePath, int n) {
File file = new File(filePath);
StringBuilder builder = new StringBuilder();
try {
RandomAccessFile randomAccessFile = new RandomAccessFile(filePath, "r");
long pos = file.length() - 1;
randomAccessFile.seek(pos);
for (long i = pos - 1; i >= 0; i--) {
randomAccessFile.seek(i);
char c = (char) randomAccessFile.read();
if (c == '\n') {
n--;
if (n == 0) {
break;
}
}
builder.append(c);
}
builder.reverse();
System.out.println(builder.toString());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
Here is the best way I've found to do it. Simple and pretty fast and memory efficient.
public static void tail(File src, OutputStream out, int maxLines) throws FileNotFoundException, IOException {
BufferedReader reader = new BufferedReader(new FileReader(src));
String[] lines = new String[maxLines];
int lastNdx = 0;
for (String line=reader.readLine(); line != null; line=reader.readLine()) {
if (lastNdx == lines.length) {
lastNdx = 0;
}
lines[lastNdx++] = line;
}
OutputStreamWriter writer = new OutputStreamWriter(out);
for (int ndx=lastNdx; ndx != lastNdx-1; ndx++) {
if (ndx == lines.length) {
ndx = 0;
}
writer.write(lines[ndx]);
writer.write("\n");
}
writer.flush();
}
(See commend)
public String readFromLast(File file, int howMany) throws IOException {
int numLinesRead = 0;
StringBuilder builder = new StringBuilder();
try (RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r")) {
try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
long fileLength = file.length() - 1;
/*
* Set the pointer at the end of the file. If the file is empty, an IOException
* will be thrown
*/
randomAccessFile.seek(fileLength);
for (long pointer = fileLength; pointer >= 0; pointer--) {
randomAccessFile.seek(pointer);
byte b = (byte) randomAccessFile.read();
if (b == '\n') {
numLinesRead++;
// (Last line often terminated with a line separator)
if (numLinesRead == (howMany + 1))
break;
}
baos.write(b);
fileLength = fileLength - pointer;
}
/*
* Since line is read from the last so it is in reverse order. Use reverse
* method to make it ordered correctly
*/
byte[] a = baos.toByteArray();
int start = 0;
int mid = a.length / 2;
int end = a.length - 1;
while (start < mid) {
byte temp = a[end];
a[end] = a[start];
a[start] = temp;
start++;
end--;
}// End while
return new String(a).trim();
} // End inner try-with-resources
} // End outer try-with-resources
} // End method
I tried RandomAccessFile first and it was tedious to read the file backwards, repositioning the file pointer upon every read operation. So, I tried #Luca solution and I got the last few lines of the file as a string in just two lines in a few minutes.
InputStream inputStream = Runtime.getRuntime().exec("tail " + path.toFile()).getInputStream();
String tail = new BufferedReader(new InputStreamReader(inputStream)).lines().collect(Collectors.joining(System.lineSeparator()));
Code is 2 lines only
// Please specify correct Charset
ReversedLinesFileReader rlf = new ReversedLinesFileReader(file, StandardCharsets.UTF_8);
// read last 2 lines
System.out.println(rlf.toString(2));
Gradle:
implementation group: 'commons-io', name: 'commons-io', version: '2.11.0'
Maven:
<dependency>
<groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.11.0</version>
</dependency>

Categories

Resources