Does anyone happen to know if there is any difference with regards to performance between the two methods of reading input file below?
Thanks.
1) Reading a file with Scanner and File
Scanner input = new Scanner(new File("foo.txt"));
2) Reading a file with InputStreamReader and FileInputStream
InputStreamReader input = new InputStreamReader(new FileInputStream("foo.txt"));
The first point is that neither of those code samples read a file. This may sound fatuous or incorrect, but it is true. What they actually do is open a file for reading. And in terms of what they actually do, there's probably not a huge difference in their respective efficiency.
When it comes to actually reading the file, the best approach to use will depend on what the file contains, what form the data has to be in for your in-memory algorithms, etc. This will determine whether it is better to use Scanner or a raw Reader, from a performance perspective and more importantly from the perspective of making your code reliable and maintainable.
Finally, the chances are that this won't make a significant difference to the overall performance of your code. What I'm saying is that you are optimizing your application prematurely. You are better of ignoring performance for now and choosing the version that will make the rest of your code simpler. When the application is working, profile it with some representative input data. The profiling will tell you the time is spent reading the file, in absolute terms, and relative to the rest of the application. This will tell you whether it is worth the effort to try to optimize the file reading.
The only bit of performance advice I'd give is that character by character reading from an unbuffered input stream or reader is inefficient. If the file needs to be read that way, you should add a BufferedReader to the stack.
In terms of performance, Scanner is definitely the slower one, at least from my experience. It's made for parsing, not reading huge blocks of data. InputStreamReader, with a large enough buffer, can perform on par with BufferedReader, which I remember to be a few times faster than Scanner for reading from a dictionary list. Here's a comparison between BufferedReader and InputStreamReader. Remember that BufferedReader is a few times faster than Scanner.
A difference, and the principal, I guess, is that with the BufferedReader/InputStreamReader you can read the whole document character by character if you want. With the scanner, this is not possible. It means that with InputStreamReader you can have more control over the content of the document. ;)
Related
In a Spring Boot app, I am reading csv file data using OpenCSV and it is possible to use FileReader to BufferedReader with it. However, when I compare both of them, I have a dilemma for the following point:
BufferedReader is faster than FileReader, but it uses much more memory.
As I am reading multiple data file (having hundreds of thousands records) in the same method (first I read data from one csv and then use the retrieved id fields to read the second csv), I think I shouldn't use BufferedReader for less memory usage. But I am really not sure what is the most proper way.
So, in this situation, Should I prefer FileReader to BufferedReader?
Generally speaking, depends on your constraints. If performance is an issue, allocate more resources and go for the faster solution. If memory is an issue, do the reverse.
With BufferedReader you can also use the reader and int constructor to set buffer size, which suits your needs.
BufferedReader reader = new BufferedReader(Reader, bufferSize);
Another general rule of thumb, don't do premature optimizations, be it memory or performance. Strive for clean code, if a problem arises, use a profiler to identify the bottlenecks and then deal with them.
As far as i know the difference in size lies simply in the buffer size, which by default is 8k or 16k, so the difference is memory isn't huge; the most important thing is you remember to free the resources when you don' use them anymore, calling close(), also remember to do it incase of Exceptions
So if there is a java.io.StringBufferInputStream, you would think that there would be a StringBufferOutputStream.
Any ideas as to why there isn't??
Likewise,there is also a SequenceInputStream but no SequenceOutputStream.
My guess is that someone never got around to making a StringBufferOutputStream in Java 1.0 since the product was somewhat "rushed to market." By the time Java 1.1 rolled around and people actually understood that readers and writers were for characters, and inputstreams and outputstreams were for bytes, the whole concept of using streams for strings was realized to be wrong, so the StringBufferInputStream was rightly deprecated, with no chance ever of a partner coming along.
A SequenceInputStream is a nice way to read from a bunch of streams all concatenated together, but it doesn't make much sense to write a single stream to multiple streams. Well, I suppose you could make sense of this if you wanted to write a large stream into multiple partitions (reminds me of Hadoop here). It's just not common enough to be in a standard library. A complication here would be that you would need to specify the size of each output partition and would really only make sense for files (which can have names with increasing suffixes, perhaps), and so would not generalize into arbitrary output streams in a nice manner.
StringBufferInputStream is deprecated, because bytes and characters are not the same thing. The correct classes to use for this are StringReader and StringWriter.
If you think about it, there is no way to make a SequenceOutputStream work. SequenceInputStream reads from the first stream until it is exhausted, then reads from the next stream. Since an OutputStream is never exhausted (unless, say, it happens to be connected to a socket whose peer closes the connection), how would a SequenceOutputStream class know when to move on to the next stream?
StringBufferInputStream has long been depreciated.
Use StringReader and StringWriter.
I am attempting to do a problem on Interview Street, my question is not related to the algorithm but to Java. For the challenge there is the need to take a somewhat large number of lines of input (several hundred thousand) from System.in. Each line has an expected pattern of two or three tokens so there is no need to do any validation or parsing (making Scanner ineffective). My own algorithm is correct and accounts for a very small portion of the overall run time (range of 5%-20% depending on the edge case).
Doing some research and testing I found that for this problem that the BufferedReader class is significantly faster than the Scanner class for getting the inputted data for this problem. However BufferedReader is still not quick enough for the purposes of the challenge. Could anyone point me to an article or API where I could research a better way of taking input?
If it important I am using BufferedReader by calling the readLine() method and String split() method to separate the tokens.
Without any useful information, the best I can do is provide a generalized answer: http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
I can think of a few things (from the top of my head):
try to create your own reader, or even forget about converting to characters if it is not needed;
read in whole blocks, not just lines
try to optimize the buffer size;
walk through the chars or bytes yourself, trying to find the tokens
optimize the compiler output
pre-compile your classes for a fast startup
use a profiler to check for slow spots in your code
Use your brain and think out of the box.
BufferedDataInputStream is supposed to be way faster than BufferedReader.
You can find the jar here: http://www.jarvana.com/jarvana/view/org/apache/tika/tika-app/0.8/tika-app-0.8.jar!/nom/tam/util/BufferedDataInputStream.class?classDetails=ok.
The javadoc http://skyview.gsfc.nasa.gov/jar/javadocs/nom/tam/util/BufferedDataInputStream.html.
This is part of this project http://skyview.gsfc.nasa.gov/help/skyviewjava.html.
Note that I have never tested this...
I am reading from an InputStream.
and writing what I read into an outputStream.
I also check a few things.
Like if I read an
& (ampersand)
I need to write
"& amp;"
My code works. But now I wonder if I have written the most efficient way (which I doubt).
I read byte by byte. (but this is because I need to do odd modifications)
Can somebody who's done this suggest the fastest way ?
Thanks
If you are using BufferedInputStream and BufferedOutputStream then it is hard to make it faster.
BTW if you are processing the input as characters as opposed to bytes, you should use readers/writers with BufferedReader and BufferedWriter.
The code should be reading/writing characters with Readers and Writers. For example, if its in the middle of a UTF-8 sequence, or it gets the second half of a UCS-2 character and it happens to read the equivalent byte value of an ampersand, then its going to damage the data that its attempting to copy. Code usually lives longer than you would expect it to, and somebody might try to pick it up later and use it in a situation where this could really matter.
As far as being faster or slower, using a BufferedReader will probably help the most. If you're writing to the file system, a BufferedWriter won't make much of a difference, because the operating system will buffer writes for you and it does a good job. If you're writing to a StringWriter, then buffering will make no difference (may even make it slower), but otherwise buffering your writes ought to help.
You could rewrite it to process arrays; and that might make it faster. You can still do that with arrays. You will have to write more complicated code to handle boundary conditions. That also needs to be a factor in the decision.
Measure, don't guess, and be wary of opinions from people who aren't informed of all the details. Ultimately, its up to you ot figure out if its fast enough for this situation. There is no single answer, because all situations are different.
I would prefer to use BufferedReader for reading input and BufferedWriter for output. Using Regular Expressions for matching your input can make your code short and also improve your time complexity.
I have a BufferedReader looping through a file. When I hit a specific case, I would like to continue looping using a different instance of the reader but starting at this point.
Any ideas for a recommended solution? Create a separate reader, use the mark function, etc.?
While waiting for your answer to my comment, I'm stuck with making assumptions.
If it's the linewise input you value, you may be as pleasantly surprised as I was to discover that RandomAccessFile now (since 1.4 or 1.5) supports the readLine method. Of course RandomAccessFile gives you fine-grained control over position.
If you want buffered IO, you may consider wrapping a reader around a CharacterBuffer or maybe a ByteBuffer wrapped around a file mapped using the nio API. This gives you the ability to treat a file as memory, with fine control of the read pointer. And because the data is all in memory, buffering is included free of charge.
Have you looked at BufferedReader's mark method? Used in conjunction with reset it might meet your needs.
If you keep track of how many characters you've read so far, you can create a new BufferedReader and use skip.
As Noel has pointed out, you would need to avoid using BufferedReader.readLine(), since readLine() will discard newlines and make your character count inaccurate. You probably shouldn't count on readLine() never getting called if anyone else will ever have to maintain your code.
If you do decide to use skip, you should write your own buffered Reader which will give you an offset counting the newlines.