Is java.util.Scanner that slow?

Is java.util.Scanner that slow? - java

In a Android application I want to use Scanner class to read a list of floats from a text file (it's a list of vertex coordinates for OpenGL). Exact code is:
Scanner in = new Scanner(new BufferedInputStream(getAssets().open("vertexes.off")));
final float[] vertexes = new float[nrVertexes];
for(int i=0;i<nrVertexFloats;i++){
vertexes[i] = in.nextFloat();
}
It seems however that this is incredibly slow (it took 30 minutes to read 10,000 floats!) - as tested on the 2.1 emulator. What's going on?
I don't remember Scanner to be that slow when I used it on the PC (truth be told I never read more than 100 values before). Or is it something else, like reading from an asset input stream?
Thanks for the help!

As other posters have stated it's more efficient to include the data in a binary format. However, for a quick fix I've found that replacing:
scanner.nextFloat();
with
Float.parseFloat(scanner.next());
is almost 7 times faster.
The source of the performance issues with nextFloat are that it uses a regular expression to search for the next float, which is unnecessary if you know the structure of the data you're reading beforehand.
It turns out most (if not all) of the next* use regular expressions for a similar reason, so if you know the structure of your data it's preferable to always use next() and parse the result. I.E. also use Double.parseDouble(scanner.next()) and Integer.parseInt(scanner.next()).
Relevant source:
https://android.googlesource.com/platform/libcore/+/master/luni/src/main/java/java/util/Scanner.java

Don't know about Android, but at least in JavaSE, Scanner is slow.
Internally, Scanner does UTF-8 conversion, which is useless in a file with floats.
Since all you want to do is read floats from a file, you should go with the java.io package.
The folks on SPOJ struggle with I/O speed. It's is a Polish programming contest site with very hard problems. Their difference is that they accept a wider array of programming languages than other sites, and in many of their problems, the input is so large that if you don't write efficient I/O, your program will burst the time limit.
Of course, I advise against writing your own float parser, but if you need speed, that's still a solution.

For the Spotify Challenge they wrote a small java utility for parsing IO faster: http://spc10.contest.scrool.se/doc/javaio The utility is called Kattio.java and uses BufferedReader, StringTokenizer and Integer.parseInt/Double.parseDouble/Long.parseLong to read numerics.

Very Insightful post. Normally when I worked with Java thought Scanner is fastest on PC. The same when I try to use it in AsyncTask on Android, its WORST.
I think Android must come up with alternative to scanner. I was using scanner.nextFloat(); & scanner.nextDouble(); & scanner.nextInt(); all together which made my life sick. After I did my tracing of my app, found that the culprit was sitting hidden.
I did change to Float.parseFloat(scanner.next()); similarly Double.parseDouble(scanner.next()); & Integer.parseInt(scanner.next());, which certainly made my app quite fast I must agree, may be 60% faster.
If anyone have experienced the same, please post here. And I'm too looking out at alternative to Scanner API, any one have bright ideas can come forward and post here on reading file formats.

Yes I'm not seeing anything like this. I can read about 10M floats this way in 4 secs on the desktop, but it just can't be that different.
I'm trying to think of other explanations -- is it perhaps blocking in reading the input stream from getAssets()? I might try reading that resource fully, timing that, then seeing how much additional time is taken to scan.

Scanner may be part of the problem, but you need to profile your code to know. Alternatives may be faster. Here is a simple benchmark comparing Scanner and StreamTokenizer.

I got the exactly same problem. It took 10 minutes to read my 18 KB file. In the end I wrote a desktop application that converts those human readable numbers into machine-readable format, using DataOutputStream.
The result was astonishing.
Btw, when I traced it, most of the Scanner method calls involves regular expressions, whose implementation is provided by com.ibm.icu.** packages (IBM ICU project). It's really overkill.
The same goes for String.format. Avoid it in Android!

Related

What are some real uses cases for the methods skip and reset in BufferedReader?

I'm trying to find out, what are the methods mark() and reset() of BufferedReader really useful for?
I understand what they are doing, but for going back and forth in some text I never used them - usually I solve this problem by reading either a sequence of chars or the whole line in an array or StringBuilder and go back and forth through it.
I believe there must be some reason why these methods are present in the BufferedReader and other Reader implementations supporting it but I'm unable to make an assumption why.
Does the usage of mark() & reset provide some benefit compared to reading the data in our own array and navigating through it?
I've searched through the codebase of one of my large projects I'm working on (mainly Java backend using Spring Boot), with lots of dependencies on the classpath and the only thing for which the mark & reset methods were used (in only very few libraries) was skipping an optional BOM character at the beginning of a text file. And even for this simple use case, I find it a bit contrived to do it that way.
Also, I was searching for other tutorials and on Stackoverflow (e.g. What are mark and reset in BufferedReader?) and couldn't find any explanation why to actually solve these kinds of problems using mark & reset. All code examples only explain what the methods are doing on "hello world" examples (jumping from one position in the stream back to a previous position for no particular reason). Nowhere I could find any explanation why someone should actually use it among other ways which sound more elegant and aren't really of worse performance.

I haven't used them myself, but a case that springs to mind is where you want to copy the data into a structure that needs to be sized correctly.
When reading streams and copying data into a target data structure (perhaps after parsing it), you always have the problem that you don't know how big to make your target in advance. The mark/rewind feature lets you mark, read the stream, parse it quickly to calculate the size, reset, allocate the memory, and then re-parse copying the data this time. There are of course other ways of doing it (e.g., using your own dynamic buffer), but if your code is already centered around the Reader concept then mark/reset lets you stay with that.
That said, even BufferedReader's own readLine method doesn't use this technique (it creates a StringBuffer internally).

NegativeArraySizeException ANTLRv4

I have a 10gb file and I need to parse it in Java, whereas the following error arises when I attempt to do this.
java.lang.NegativeArraySizeException
at java.util.Arrays.copyOf(Arrays.java:2894)
at org.antlr.v4.runtime.ANTLRInputStream.load(ANTLRInputStream.java:123)
at org.antlr.v4.runtime.ANTLRInputStream.<init>(ANTLRInputStream.java:86)
at org.antlr.v4.runtime.ANTLRInputStream.<init>(ANTLRInputStream.java:82)
at org.antlr.v4.runtime.ANTLRInputStream.<init>(ANTLRInputStream.java:90)
How can I solve this problem properly? How can I adjust such an input stream to handle this error?

It looks like ANTLR v4 has a pervasive hard-wired limitation that input stream size is less that 2^31 characters. Removing this limitation would not be a small task.
Take a look at the source code for the ANTLRInputStream class - here.
As you can see, it attempts to hold the entire stream contents in a single char[]. That ain't going to work ... for huge input files. But simply fixing that by buffering the data in a larger data structure isn't going to be the answer either. If you look further down the file, there are a number of other methods that use int as the type for indexing the stream. They would need to be changed to use long ... and the changes will ripple out.
How can I solve this problem properly? How can I adjust such an input stream to handle this error?
Two approaches spring to mind:
Create your own version of ANTLR that supports large input files. This is a non-trivial project. I expect that the 32 bit assumption reaches into the code that ANTLR generates, etc.
Split your input files into smaller files before you attempt to parse them. Whether this is viable depends on the input syntax.
My recommendation would be the 2nd alternative. The problem with "supporting" huge input files (by in-memory buffering) is that it is going to be inefficient and memory wasteful ... and it ultimately doesn't scale.
You could also create an issue here, or ask on antlr-discussion.

i never stumbled upon this error, but i guess your array gets too big and it's index overflows (e.g., the integer wraps around and becomes negative). use another data structure, and most importantly, don't load all of the file at once (use lazy loading instead, that means, load only those parts that are being accessed)

I hope this will help http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
You might want to have some kind of buffer to read big files.

The Efficiency of Hard-Coding vs. File Input

I'm working on a machine learning project in Java which will involve a very large model (the output of a Support Vector Machine, for those of you familiar with that) that will need to be retrieved fairly frequently for use by the end user. The bulk of the model consists of large two-dimensional array of fairly small objects.
Unfortunately, I do not know exactly how large the model is going to be (I've been working with benchmark data so far, and the data I'm actually going to be using isn't ready yet), nor do I know the specifications of the machine it will run on, as that is also up in the air.
I already have a method to write the model to a file as a string, but the write process takes a great deal of time and the read process takes the better part of a minute. I'd like to cut down on that time, so I had the either bright or insanely convoluted idea of writing the model to a .java file in such a way that it could be compiled and then run to produce a fully formed model.
My questions to you are, will storing and compiling the model in Java be significantly faster than reading it from the file, under the assumption that the model is about 1 MB in size? And is there some reason I haven't seen yet that this could be a fantastically stupid idea that I should not pursue under any circumstances?
Thank you for any ideas you can give me.
EDIT: apparently trying to automatically write several thousand values into code makes a method that is roughly two orders of magnitude larger than the compiler can handle. Ah well, live and learn.

Instead of writing to a string or to a java file, you might consider creating a compact binary format for you data.

Will storing and compiling the model in Java be significantly faster
than reading it from the file ?
That depends on the way you fashion your custom datastructure to contain your model.

The question IMHO is if the reading of the file takes long because of IO or because of computing time (=> CPU). If the later is the case then tough luck. If your IO (e.g. hard disc) is the cause then you can compress the file and extract it after/while reading. There is (of course) ZIP-support in Java (even for Streams).

I agree with the answer given above to use a binary input format. Let's try optimising that first. Can you provide some information? ...or have you googled working with binary data? ...buffering it? etc.?
Writing a .java file and compiling it will be quiet interesting... but it is bound to give your issues at some point. However, I think you will find that it will be slightly slower than an optimised binary format, but faster than text-based input.
Also, be very careful for early optimisation. Usually, "highly-configurable" and "blinding fast" is mutual exclusive. Rather, get everything to work first and then use a profiler to optimise the really slow sections of the application.

Taking Java Input Quickly

I am attempting to do a problem on Interview Street, my question is not related to the algorithm but to Java. For the challenge there is the need to take a somewhat large number of lines of input (several hundred thousand) from System.in. Each line has an expected pattern of two or three tokens so there is no need to do any validation or parsing (making Scanner ineffective). My own algorithm is correct and accounts for a very small portion of the overall run time (range of 5%-20% depending on the edge case).
Doing some research and testing I found that for this problem that the BufferedReader class is significantly faster than the Scanner class for getting the inputted data for this problem. However BufferedReader is still not quick enough for the purposes of the challenge. Could anyone point me to an article or API where I could research a better way of taking input?
If it important I am using BufferedReader by calling the readLine() method and String split() method to separate the tokens.

Without any useful information, the best I can do is provide a generalized answer: http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/

I can think of a few things (from the top of my head):
try to create your own reader, or even forget about converting to characters if it is not needed;
read in whole blocks, not just lines
try to optimize the buffer size;
walk through the chars or bytes yourself, trying to find the tokens
optimize the compiler output
pre-compile your classes for a fast startup
use a profiler to check for slow spots in your code
Use your brain and think out of the box.

BufferedDataInputStream is supposed to be way faster than BufferedReader.
You can find the jar here: http://www.jarvana.com/jarvana/view/org/apache/tika/tika-app/0.8/tika-app-0.8.jar!/nom/tam/util/BufferedDataInputStream.class?classDetails=ok.
The javadoc http://skyview.gsfc.nasa.gov/jar/javadocs/nom/tam/util/BufferedDataInputStream.html.
This is part of this project http://skyview.gsfc.nasa.gov/help/skyviewjava.html.
Note that I have never tested this...

Java - Creating a string by selecting certain numbers from within a text file

I have a .txt file that has numbers 1-31 in it called numbers. I've set up a scanner like this:
cardCreator = new Scanner (new File ("numbers"));
After this, I'm kind of confused on which methods I could use.
I want to set up a conditional statement that will go through numbers, evaluate each line (or numbers 1-31), and include each number a string called numbersonCard, if they meet certain criteria.
I can kind of picture how this would work (maybe using hasNext(), or nextLine() or something), but I'm still a little overwhelmed by the API...
Any suggestions as to which methods I might use?

Computer science is all about decomposition. Break your problem up into smaller ones.
First things first: can you read that file and echo it back?
Next, can you break each line into its number pieces?
Then, can you process each number piece as you wish?
You'll make faster progress if you can break the problem down.
I think Java Almanac is one of the best sources of example snippets around. Maybe it can help you with some of the basics.

Building upon duffymo's excellent advice, you might want to break your problem down into high-level pseudocode. The great thing about this is, you can actually write these steps as comments in your code, and then solve each bit as a separate problem. Better still, if you break down the problem right, you can actually place each piece into a separate method.
For example:
// Open the file
// Read in all the numbers
// For each number, does it meet our criteria?
// If so, add it to the list
Now you can address each part of the problem in a somewhat isolated fashion. Furthermore, you can start to see that you might break this down into methods which can open the file (and deal with any errors thrown by the API), read in all the numbers from the file, determine whether a given number meets your criteria, etc. A little clever method naming, and your code will literally read like the pseudocode, making it more maintainable in the future.

To give a little more specific help than the excellent answers duffymo and Rob have given, your instincts are right. You probably want something like this:
cardCreator = new Scanner (new File ("numbers"));
while (cardCreator.hasNextInt()) {
int number = cardCreator.nextInt();
// do something with number
}
cardCreator.close();
hasNext() and nextInt() will save you from getting a String from the Scanner and having to parse it yourself. I'm not sure if Scanner's default delimiter will interpret a Windows end-of-line CRLF to be one or two delimiters, though.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.