I have a large text file but doesn't have any line break. It just contains a long String (1 huge line of String with all ASCII characters), but so far anything works just fine as I can be able to read the whole line into memory in Java, but i am wondering if there could be a memory leak issue as the file becomes so big like 5GB+ and the program can't read the whole file into memory at once, so in that case what will be the best way to read such file ? Can we break the huge line into 2 parts or even multiple chunks ?
Here's how I read the file
BufferedReader buf = new BufferedReader(new FileReader("input.txt"));
String line;
while((line = buf.readLine()) != null){
}
A single String can be only 2 billion characters long and will use 2 byte per character, so if you could read a 5 GB line it would use 10 GB of memory.
I suggest you read the text in blocks.
Reader reader = new FileReader("input.txt");
try {
char[] chars = new char[8192];
for(int len; (len = reader.read(chars)) > 0;) {
// process chars.
}
} finally {
reader.close();
}
This will use about 16 KB regardless of the size of the file.
There won't be any kind of memory-leak, as the JVM has its own garbage collector. However you will probably run out of heap space.
In cases like this, it is always best to import and process the stream in manageable pieces. Read in 64MB or so and repeat.
You also might find it useful to add the -Xmx parameter to your java call, in order to increase the maximum heap space available within the JVM.
its better to read the file in chunks and then concatenate the chunks or do whatever you want wit it, because if it is a big file you are reading you will get heap space issues
an easy way to do it like below
InputStream is;
OutputStream os;
byte buffer[] = new byte[1024];
int read;
while((read = is.read(buffer)) != -1)
{
// do whatever you need with the buffer
}
In addition to the idea of reading in chunks, you could also look at memory mapping areas of the file using java.nio.MappedByteBuffer. You would still be limited to a maximum buffer size of Integer.MAX_VALUE. This may be better than explicitly reading chunks if you will be making scattered accesses within a chunk.
To read chunks from file or write same to some file this could be used:
{
in = new FileReader("input.txt");
out = new FileWriter("output.txt");
char[] buffer = new char[1024];
int l = 0;
while ( (l = in.read(buffer)) > 0 ) {
out.write(buffer, 0, l);
}
You won't run into any memory leak issues, but possible heap space issues. To avoid heap issues, use a buffer.
It all depends on how you are currently reading the line. It is possible to avoid all heap issues by using a buffer.
public void readLongString(String superlongString, int size, BufferedReader in){
char[] buffer = new char[size];
for(int i=0;i<superlongString.length;i+=size;){
in.read(buffer, i, size)
//do stuff
}
}
Related
I'm using the following 2 pieces of codes to read a large file.
This using a FileReader:
File file = new File("/Users/Desktop/shakes.txt");
FileReader reader = new FileReader(file);
int ch;
long start = System.currentTimeMillis();
while ((ch = reader.read()) != -1) {
System.out.print((char) ch);
}
long end = System.currentTimeMillis();
And the following using a BufferedReader:
File file = new File("/Users/Desktop/shakes.txt");
BufferedReader reader = new BufferedReader(new FileReader(file));
int ch;
long start = System.currentTimeMillis();
while ((ch = reader.read()) != -1) {
System.out.print((char) ch);
}
long end = System.currentTimeMillis();
Going by the documentation for BufferedReader:
It is therefore advisable to wrap a BufferedReader around any Reader whose read() operations may be costly, such as FileReaders and InputStreamReaders. Without buffering, each invocation of read() or readLine() could cause bytes to be read from the file, converted into characters, and then returned, which can be very inefficient.
Given this documentation and the default buffer size of 8192 of the BufferedReader class, shouldn't the overall time for reading the file with BufferedReader be quicker? Currently, both pieces of code run in ~3000ms on my machine. However, if I use 'readLine' in the BufferedReader, the performance substantially improves (~200ms).
Thoughts on something that I'm missing? Is it not expected that even with the 'read()' method, BufferedReader should give a better performance than reading from FileReader?
Using BufferedReader is indeed faster than using just FileReader.
I executed your code on my machine, with the following text file https://norvig.com/big.txt (6MB).
The initial result shows roughly the same time. About 17 seconds each.
However, this is because System.out.print() is a bottleneck (within the loop). Without print, the result is 4 times faster with BufferedReader. 200ms vs 50ms. (Compare it to 17s before)
In other words, don't use System.out.print() when benchmarking.
Example
An improved benchmark could look like this using StringBuilder.
File file = new File("/Users/Desktop/shakes.txt");
FileReader reader = new FileReader(file);
int ch;
StringBuilder sb = new StringBuilder();
long start = System.currentTimeMillis();
while ((ch = reader.read()) != -1) {
//System.out.print((char) ch);
sb.append((char) ch);
}
long end = System.currentTimeMillis();
System.out.println(sb);
The above code provides the same output but performs much faster. It will accurately show the difference in speed when using a BufferedReader.
Thoughts on something that I'm missing?
It should be faster to read a file a character at a time from a BufferedReader than a FileReader. (By orders of magnitude!) So I suspect that problem is in your benchmarks.
Your benchmark is measuring both reading the file, and writing it to standard output. So basically, your performance figures will be distorted by the overheads of writing the file. And if your output is being written to a "console", then those overheads include the overheads of painting characters to the screen ... and scrolling.
Your benchmark takes no account of vm startup overheads.
Your benchmark doesn't (obviously) take the effect of file caching. (The first time a file is read, it will be read from disc. If you read it again soon afterwards, you may be reading from a copy of the file cached in memory by the operating system. That will be faster.)
I have this code:
public static void main(String[] args) {
System.out.println("Reading file...");
String content = readFile(args[0]);
System.out.println("Done reading file.");
}
private static String readFile(String file) throws IOException {
BufferedReader reader = new BufferedReader( new FileReader (file));
String line = null;
StringBuilder stringBuilder = new StringBuilder();
while( ( line = reader.readLine() ) != null ) {
stringBuilder.append( line );
}
return stringBuilder.toString();
}
The readFile method works fine, well, for small files.
The thing I noticed is that it takes too much memory.
If I open the System Monitor on windows (CTRL-SHIFT-ESC), I see the java process taking up to 1,8GB RAM, while the size of my file is just 550MB.
Yes, I know, loading a file entirely into memory isn't a good idea, I'm doing this just for curiosity.
The program gets stuck at Reading file... when the newly created java process starts, it takes a bunch of MB of RAM and goes up to 1,8GB.
I also tried using String concatenation instead of using StringBuilder, but I have the exact same result.
Why does it take so much memory? Is the final stringBuilder.toString causing this?
You have to remember how these libraries work.
One byte on disk can turn into 2 byte char. The StringBuilder grows by doubling in capacity so it can be up to twice as large as you really need, and you need both the StringBuilder and String in memory at the same time.
So take your example. 550 MB can turn into 1100 MB as char alone. However, the size doubles in size so the it will be approximately the next power of two i.e. it could be 2 GB, and this is on top of a the String which would be 550 MB.
Note: the reason it is not using this much memory is that you have a bug. You are discarding all the new lines \r\n which means you have less characters.
When processing a large file where you don't have enough memory to load it into memory at once, you are better off processing the data as your read it.
BTW If you have plenty of memory, you can read the file faster, with less memory this way.
static String readFile(String file) throws IOException {
try(FileInputStream fis = new FileInputStream(file)) {
byte[] bytes = new byte[(int) fis.available()];
fis.read(bytes);
return new String(bytes);
}
}
In my servlet, I am currently setting the XML file to a variable like this:
String xmlFileAsString = CharStreams.toString(new
InputStreamReader(request.getInputStream(), "UTF-8"));
Now after this line I can check if the file size is too large etc., but that means the entire file has already been stream and loaded into memory.
Is there a way for me to get the input stream, but while this is streaming the file it should abort if the file size is say above 10MB?
You can read the stream sequentially and count the number of characters read. First don't use CharStreams since it already reads the entire file. Create an InputStreamReader object:
InputStreamReader reader;
reader = new InputStreamReader(request.getInputStream(), "UTF-8");
A variable to keep track of the char count:
long charCount = 0;
And then the code to read the file:
char[] cbuf = new char[10240]; // size of the read buffer
int charsRead = reader.read(cbuf); // read first set of chars
StringBuilder buffer = new StringBuilder(); // accumulate the data read here
while(charsRead > 0) {
buffer.append(cbuf, 0, charsRead);
if (charCount > LIMIT) { // define a LIMIT constant with your size limit
throw new XMLTooLargeException(); // treat the problem with an exception
}
}
String xmlFileAsString = buffer.toString(); //if not too large, get the string
I need to do processes on a file ,first count the number of lines and compare with a value.
The next is one to read thru the file line by line and do validations.
if first one passes only i need to do second process.
I read the same file using FTP.
When i try to create a different input stream...ftp is busy reading the current file.
like this :
(is1 = ftp.getFile(feedFileName);)
below is the remaining :
InputStream is = null;
LineNumberReader lin = null;
LineNumberReader lin1 = null;
is = ftp.getFile(feedFileName);
lin = new LineNumberReader(new InputStreamReader(is));
so can i just use like below:
is1=is;
Will both streams be having the file contents from start to finish or the second object will become null as soon as the first stream object is read.
So is the only option left is to create a new ftp object to read a stream seperately ?
It can, but you would need to "rewind" the InputStream. First you need to call mark() method on it, and then reset. Here are docs: http://docs.oracle.com/javase/6/docs/api/java/io/InputStream.html#reset()
After you are done with the LineNumberReader, close the InputStream is. Then re-request the file from FTP, it will not be busy then anymore. You cannot 'just' read from the same InputStream, as that one is probably exhausted by the time the LineNumberReader is done. Furthermore, not all InputStreams support the mark() and reset() methods.
However I'd suggest that doing the second process only when the first one succeeds might not be the right way. As you're streaming the data anyways, why not stream it into a temporary data structure and then count the lines and then operate on the same data structure.
if you file is not big, you can save data to a String.
liek:
StringBuilder sb = new StringBuilder();
byte[] buffer = new byte[1024];
int len;
while((len = is.read(buffer))!=-1)
sb.append(buffer, 0, len);
String data = sb.toString();
then you can do further thing in the String
like:
int lineNumber = data.split("\n").length;
I've never had close experiences with Java IO API before and I'm really frustrated now. I find it hard to believe how strange and complex it is and how hard it could be to do a simple task.
My task: I have 2 positions (starting byte, ending byte), pos1 and pos2. I need to read lines between these two bytes (including the starting one, not including the ending one) and use them as UTF8 String objects.
For example, in most script languages it would be a very simple 1-2-3-liner like that (in Ruby, but it will be essentially the same for Python, Perl, etc):
f = File.open("file.txt").seek(pos1)
while f.pos < pos2 {
s = f.readline
# do something with "s" here
}
It quickly comes hell with Java IO APIs ;) In fact, I see two ways to read lines (ending with \n) from regular local files:
RandomAccessFile has getFilePointer() and seek(long pos), but it's readLine() reads non-UTF8 strings (and even not byte arrays), but very strange strings with broken encoding, and it has no buffering (which probably means that every read*() call would be translated into single undelying OS read() => fairly slow).
BufferedReader has great readLine() method, and it can even do some seeking with skip(long n), but it has no way to determine even number of bytes that has been already read, not mentioning the current position in a file.
I've tried to use something like:
FileInputStream fis = new FileInputStream(fileName);
FileChannel fc = fis.getChannel();
BufferedReader br = new BufferedReader(
new InputStreamReader(
fis,
CHARSET_UTF8
)
);
... and then using fc.position() to get current file reading position and fc.position(newPosition) to set one, but it doesn't seem to work in my case: looks like it returns position of a buffer pre-filling done by BufferedReader, or something like that - these counters seem to be rounded up in 16K increments.
Do I really have to implement it all by myself, i.e. a file readering interface which would:
allow me to get/set position in a file
buffer file reading operations
allow reading UTF8 strings (or at least allow operations like "read everything till the next \n")
Is there a quicker way than implementing it all myself? Am I overseeing something?
import org.apache.commons.io.input.BoundedInputStream
FileInputStream file = new FileInputStream(filename);
file.skip(pos1);
BufferedReader br = new BufferedReader(
new InputStreamReader(new BoundedInputStream(file,pos2-pos1))
);
If you didn't care about pos2, then you woundn't need Apache Commons IO.
I wrote this code to read utf-8 using randomaccessfiles
//File: CyclicBuffer.java
public class CyclicBuffer {
private static final int size = 3;
private FileChannel channel;
private ByteBuffer buffer = ByteBuffer.allocate(size);
public CyclicBuffer(FileChannel channel) {
this.channel = channel;
}
private int read() throws IOException {
return channel.read(buffer);
}
/**
* Returns the byte read
*
* #return byte read -1 - end of file reached
* #throws IOException
*/
public byte get() throws IOException {
if (buffer.hasRemaining()) {
return buffer.get();
} else {
buffer.clear();
int eof = read();
if (eof == -1) {
return (byte) eof;
}
buffer.flip();
return buffer.get();
}
}
}
//File: UTFRandomFileLineReader.java
public class UTFRandomFileLineReader {
private final Charset charset = Charset.forName("utf-8");
private CyclicBuffer buffer;
private ByteBuffer temp = ByteBuffer.allocate(4096);
private boolean eof = false;
public UTFRandomFileLineReader(FileChannel channel) {
this.buffer = new CyclicBuffer(channel);
}
public String readLine() throws IOException {
if (eof) {
return null;
}
byte x = 0;
temp.clear();
while ((byte) -1 != (x = (buffer.get())) && x != '\n') {
if (temp.position() == temp.capacity()) {
temp = addCapacity(temp);
}
temp.put(x);
}
if (x == -1) {
eof = true;
}
temp.flip();
if (temp.hasRemaining()) {
return charset.decode(temp).toString();
} else {
return null;
}
}
private ByteBuffer addCapacity(ByteBuffer temp) {
ByteBuffer t = ByteBuffer.allocate(temp.capacity() + 1024);
temp.flip();
t.put(temp);
return t;
}
public static void main(String[] args) throws IOException {
RandomAccessFile file = new RandomAccessFile("/Users/sachins/utf8.txt",
"r");
UTFRandomFileLineReader reader = new UTFRandomFileLineReader(file
.getChannel());
int i = 1;
while (true) {
String s = reader.readLine();
if (s == null)
break;
System.out.println("\n line " + i++);
s = s + "\n";
for (byte b : s.getBytes(Charset.forName("utf-8"))) {
System.out.printf("%x", b);
}
System.out.printf("\n");
}
}
}
For #Ken Bloom A very quick go at a Java 7 version. Note: I don't think this is the most efficient way, I'm still getting my head around NIO.2, Oracle has started their tutorial here
Also note that this isn't using Java 7's new ARM syntax (which takes care of the Exception handling for file based resources), it wasn't working in the latest openJDK build that I have. But if people want to see the syntax, let me know.
/*
* Paths uses the default file system, note no exception thrown at this stage if
* file is missing
*/
Path file = Paths.get("C:/Projects/timesheet.txt");
ByteBuffer readBuffer = ByteBuffer.allocate(readBufferSize);
FileChannel fc = null;
try
{
/*
* newByteChannel is a SeekableByteChannel - this is the fun new construct that
* supports asynch file based I/O, e.g. If you declared an AsynchronousFileChannel
* you could read and write to that channel simultaneously with multiple threads.
*/
fc = (FileChannel)file.newByteChannel(StandardOpenOption.READ);
fc.position(startPosition);
while (fc.read(readBuffer) != -1)
{
readBuffer.rewind();
System.out.println(Charset.forName(encoding).decode(readBuffer));
readBuffer.flip();
}
}
Start with a RandomAccessFile and use read or readFully to get a byte array between pos1 and pos2. Let's say that we've stored the data read in a variable named rawBytes.
Then create your BufferedReader using
new BufferedReader(new InputStreamReader(new ByteArrayInputStream(rawBytes)))
Then you can call readLine on the BufferedReader.
Caveat: this probably uses more memory than if you could make the BufferedReader seek to the right location itself, because it preloads everything into memory.
I think the confusion is caused by the UTF-8 encoding and the possibility of double byte characters.
UTF8 doesn't specify how many bytes are in a single character. I'm assuming from your post that you are using single byte characters. For example, 412 bytes would mean 411 characters. But if the string were using double byte characters, you would get the 206 character.
The original java.io package didn't deal well with this multi-byte confusion. So, they added more classes to deal specifically with strings. The package mixes two different types of file handlers (and they can be confusing until the nomenclature is sorted out). The stream classes provide for direct data I/O without any conversion. The reader classes convert files to strings with full support for multi-byte characters. That might help clarify part of the problem.
Since you state you are using UTF-8 characters, you want the reader classes. In this case, I suggest FileReader. The skip() method in FileReader allows you to pass by X characters and then start reading text. Alternatively, I prefer the overloaded read() method since it allows you to grab all the text at one time.
If you assume your "bytes" are individual characters, try something like this:
FileReader fr = new FileReader( new File("x.txt") );
char[] buffer = new char[ pos2 - pos ];
fr.read( buffer, pos, buffer.length );
...
I'm late to the party here, but I ran across this problem in my own project.
After much traversal of Javadocs and Stack Overflow, I think I found a simple solution.
After seeking to the appropriate place in your RandomAccessFile, which I am here calling raFile, do the following:
FileDescriptor fd = raFile.getFD();
FileReader fr = new FileReader(fd);
BufferedReader br = new BufferedReader(fr);
Then you should be able to call br.readLine() to your heart's content, which will be much faster than calling raFile.readLine().
The one thing I'm not sure about is whether UTF8 strings are handled correctly.
The java IO API is very flexible. Unfortunately sometimes the flexibility makes it verbose. The main idea here is that there are many streams, writers and readers that implement wrapper patter. For example BufferedInputStream wraps any other InputStream. The same is about output streams.
The difference between streams and readers/writers is that streams work with bytes while readers/writers work with characters.
Fortunately some streams, writers and readers have convenient constructors that simplify coding. If you want to read file you just have to say
InputStream in = new FileInputStream("/usr/home/me/myfile.txt");
if (in.markSupported()) {
in.skip(1024);
in.read();
}
It is not so complicated as you afraid.
Channels is something different. It is a part of so called "new IO" or nio. New IO is not blocked - it is its main advantage. You can search in internet for any "nio java tutorial" and read about it. But it is more complicated than regular IO and is not needed for most applications.