I want to read a large text file about several GBs and process it without loading the whole file but loading chunks of it.(Processing involves counting word instances)
If I'm using a concurrent hash map to process the file in parallel to make it more efficient, is there a way to use NIO or random access file to read it in chunks? Would it make it even more efficient?
The current implementation is using a buffered reader that goes something like this:
while(lines.size() <= numberOfLines && (line = bufferedReader.readLine()) != null) {
lines.add(line);
}
lines.parallelStream().. // processing logic using ConcurrentHashMap
RandomAccessFile makes only sense if you intend to "jump" around within the file and your description of what you're doing doesn't sound like that. NIO makes sense if you have to cope with lots of parallel communication going on and you want to do non-blocking operations e.g. on Sockets. That as well doesn't seem to be your use case.
So my suggestion is to stick with the simple approach of using a BufferedReader on top of a InputStreamReader(FileInputStream) (don't use FileReader because that doesn't allow you to specify the charset/encoding to be used) and go through the data as you showed in your sample code. Leave away the parallelStream, only if you see bad performance you might try that out.
Always remember: Premature optimization is the root of all evil.
The obvious java 7 Solution is :
String lines = Files.readAllLines(Paths.get("file"), StandardCharsets.UTF_8).reduce((a,b)->a+b);
Honestly I got no Idea if it is faster but I gues under the hood it does not read it into a buffer so at least in theory it should be faster
Related
In a Spring Boot app, I am reading csv file data using OpenCSV and it is possible to use FileReader to BufferedReader with it. However, when I compare both of them, I have a dilemma for the following point:
BufferedReader is faster than FileReader, but it uses much more memory.
As I am reading multiple data file (having hundreds of thousands records) in the same method (first I read data from one csv and then use the retrieved id fields to read the second csv), I think I shouldn't use BufferedReader for less memory usage. But I am really not sure what is the most proper way.
So, in this situation, Should I prefer FileReader to BufferedReader?
Generally speaking, depends on your constraints. If performance is an issue, allocate more resources and go for the faster solution. If memory is an issue, do the reverse.
With BufferedReader you can also use the reader and int constructor to set buffer size, which suits your needs.
BufferedReader reader = new BufferedReader(Reader, bufferSize);
Another general rule of thumb, don't do premature optimizations, be it memory or performance. Strive for clean code, if a problem arises, use a profiler to identify the bottlenecks and then deal with them.
As far as i know the difference in size lies simply in the buffer size, which by default is 8k or 16k, so the difference is memory isn't huge; the most important thing is you remember to free the resources when you don' use them anymore, calling close(), also remember to do it incase of Exceptions
I am doing a project in which there are so many files I have to handle. Problem came when I have to provide file in different manner like:
File will contain one string in each line
numbers of char in each line e.g. :
1st line : A B 4
2nd line : 6 C A 6 & U #
etc.
File will contain no. of Strings e.g.
1st line : Lion Panther jaguar etc.
I have read how to efficiently handle file but I am so confused when to use Buffered Streams and when Unbuffered. If I am using BufferedStream then BufferInputStream and BufferReader / BufferWriter which should be used.
Similarly I am confuse with I/O Stream, File I/O Stream, ByteArray I/O Stream. There are so many things. Can any one suggest me when to use which one and why? What could be efficient handling according to the different scenarios?
Well, there might not be a direct answer for this, but you don't have to worry if you feel confused. Discussions about Buffered and Unbufferred have been done many times before.
For example in this link: bufferred vs non-bufferred, gives a good hint (check the answer marked as correct). This comes because while using Bufferred streams, those streams are stored in a small area of memory called (suprisingly) buffer. Same happens to written data (they go into the buffer before being stored into the hard memory). This improves performance because lowers the overhead of I/O operations (which are OS dependent). Check the Java Doc: Bufferred Streams
So, to make it clear, use Bufferred streams when you need to improve the performance of your I/O operations. Use Unbufferred streams when you want to ensure that the output has been written before continuing (because an error might always occur while writing from/into the buffer, an example might be when you want to write a log, it might be opened all the time, so there is no need to access it, no need for a buffer ).
I want to constantly write data to disc.
And I want to flush data to disc frequently (for example every chunk of 64MB). What solution can you propose?
I think standard OutputStream might be a better choice than nio.channels because it is more straightforward.
If you are writing a continuous stream of data, for example appending to the end of a file, regular OutputStream with flush() called once in a while is just as good or better than nio. Where nio could give you a big advantage would be writing many small chunks spread over different regions of a file. In that case you could use a memory mapped file and this could be an improvement over old-style writes. However, from the question I understand you are rather dealing with a continuous stream of data. I suggest you implement the regular solution which gives you code you find nicer and only search for alternatives if you find performance lacking. In this case I wouldn't expect nio to make noticeable difference.
I am reading from an InputStream.
and writing what I read into an outputStream.
I also check a few things.
Like if I read an
& (ampersand)
I need to write
"& amp;"
My code works. But now I wonder if I have written the most efficient way (which I doubt).
I read byte by byte. (but this is because I need to do odd modifications)
Can somebody who's done this suggest the fastest way ?
Thanks
If you are using BufferedInputStream and BufferedOutputStream then it is hard to make it faster.
BTW if you are processing the input as characters as opposed to bytes, you should use readers/writers with BufferedReader and BufferedWriter.
The code should be reading/writing characters with Readers and Writers. For example, if its in the middle of a UTF-8 sequence, or it gets the second half of a UCS-2 character and it happens to read the equivalent byte value of an ampersand, then its going to damage the data that its attempting to copy. Code usually lives longer than you would expect it to, and somebody might try to pick it up later and use it in a situation where this could really matter.
As far as being faster or slower, using a BufferedReader will probably help the most. If you're writing to the file system, a BufferedWriter won't make much of a difference, because the operating system will buffer writes for you and it does a good job. If you're writing to a StringWriter, then buffering will make no difference (may even make it slower), but otherwise buffering your writes ought to help.
You could rewrite it to process arrays; and that might make it faster. You can still do that with arrays. You will have to write more complicated code to handle boundary conditions. That also needs to be a factor in the decision.
Measure, don't guess, and be wary of opinions from people who aren't informed of all the details. Ultimately, its up to you ot figure out if its fast enough for this situation. There is no single answer, because all situations are different.
I would prefer to use BufferedReader for reading input and BufferedWriter for output. Using Regular Expressions for matching your input can make your code short and also improve your time complexity.
I have a BufferedReader looping through a file. When I hit a specific case, I would like to continue looping using a different instance of the reader but starting at this point.
Any ideas for a recommended solution? Create a separate reader, use the mark function, etc.?
While waiting for your answer to my comment, I'm stuck with making assumptions.
If it's the linewise input you value, you may be as pleasantly surprised as I was to discover that RandomAccessFile now (since 1.4 or 1.5) supports the readLine method. Of course RandomAccessFile gives you fine-grained control over position.
If you want buffered IO, you may consider wrapping a reader around a CharacterBuffer or maybe a ByteBuffer wrapped around a file mapped using the nio API. This gives you the ability to treat a file as memory, with fine control of the read pointer. And because the data is all in memory, buffering is included free of charge.
Have you looked at BufferedReader's mark method? Used in conjunction with reset it might meet your needs.
If you keep track of how many characters you've read so far, you can create a new BufferedReader and use skip.
As Noel has pointed out, you would need to avoid using BufferedReader.readLine(), since readLine() will discard newlines and make your character count inaccurate. You probably shouldn't count on readLine() never getting called if anyone else will ever have to maintain your code.
If you do decide to use skip, you should write your own buffered Reader which will give you an offset counting the newlines.