I am attempting to do a problem on Interview Street, my question is not related to the algorithm but to Java. For the challenge there is the need to take a somewhat large number of lines of input (several hundred thousand) from System.in. Each line has an expected pattern of two or three tokens so there is no need to do any validation or parsing (making Scanner ineffective). My own algorithm is correct and accounts for a very small portion of the overall run time (range of 5%-20% depending on the edge case).
Doing some research and testing I found that for this problem that the BufferedReader class is significantly faster than the Scanner class for getting the inputted data for this problem. However BufferedReader is still not quick enough for the purposes of the challenge. Could anyone point me to an article or API where I could research a better way of taking input?
If it important I am using BufferedReader by calling the readLine() method and String split() method to separate the tokens.
Without any useful information, the best I can do is provide a generalized answer: http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
I can think of a few things (from the top of my head):
try to create your own reader, or even forget about converting to characters if it is not needed;
read in whole blocks, not just lines
try to optimize the buffer size;
walk through the chars or bytes yourself, trying to find the tokens
optimize the compiler output
pre-compile your classes for a fast startup
use a profiler to check for slow spots in your code
Use your brain and think out of the box.
BufferedDataInputStream is supposed to be way faster than BufferedReader.
You can find the jar here: http://www.jarvana.com/jarvana/view/org/apache/tika/tika-app/0.8/tika-app-0.8.jar!/nom/tam/util/BufferedDataInputStream.class?classDetails=ok.
The javadoc http://skyview.gsfc.nasa.gov/jar/javadocs/nom/tam/util/BufferedDataInputStream.html.
This is part of this project http://skyview.gsfc.nasa.gov/help/skyviewjava.html.
Note that I have never tested this...
Related
I'm trying to find out, what are the methods mark() and reset() of BufferedReader really useful for?
I understand what they are doing, but for going back and forth in some text I never used them - usually I solve this problem by reading either a sequence of chars or the whole line in an array or StringBuilder and go back and forth through it.
I believe there must be some reason why these methods are present in the BufferedReader and other Reader implementations supporting it but I'm unable to make an assumption why.
Does the usage of mark() & reset provide some benefit compared to reading the data in our own array and navigating through it?
I've searched through the codebase of one of my large projects I'm working on (mainly Java backend using Spring Boot), with lots of dependencies on the classpath and the only thing for which the mark & reset methods were used (in only very few libraries) was skipping an optional BOM character at the beginning of a text file. And even for this simple use case, I find it a bit contrived to do it that way.
Also, I was searching for other tutorials and on Stackoverflow (e.g. What are mark and reset in BufferedReader?) and couldn't find any explanation why to actually solve these kinds of problems using mark & reset. All code examples only explain what the methods are doing on "hello world" examples (jumping from one position in the stream back to a previous position for no particular reason). Nowhere I could find any explanation why someone should actually use it among other ways which sound more elegant and aren't really of worse performance.
I haven't used them myself, but a case that springs to mind is where you want to copy the data into a structure that needs to be sized correctly.
When reading streams and copying data into a target data structure (perhaps after parsing it), you always have the problem that you don't know how big to make your target in advance. The mark/rewind feature lets you mark, read the stream, parse it quickly to calculate the size, reset, allocate the memory, and then re-parse copying the data this time. There are of course other ways of doing it (e.g., using your own dynamic buffer), but if your code is already centered around the Reader concept then mark/reset lets you stay with that.
That said, even BufferedReader's own readLine method doesn't use this technique (it creates a StringBuffer internally).
I have a 10gb file and I need to parse it in Java, whereas the following error arises when I attempt to do this.
java.lang.NegativeArraySizeException
at java.util.Arrays.copyOf(Arrays.java:2894)
at org.antlr.v4.runtime.ANTLRInputStream.load(ANTLRInputStream.java:123)
at org.antlr.v4.runtime.ANTLRInputStream.<init>(ANTLRInputStream.java:86)
at org.antlr.v4.runtime.ANTLRInputStream.<init>(ANTLRInputStream.java:82)
at org.antlr.v4.runtime.ANTLRInputStream.<init>(ANTLRInputStream.java:90)
How can I solve this problem properly? How can I adjust such an input stream to handle this error?
It looks like ANTLR v4 has a pervasive hard-wired limitation that input stream size is less that 2^31 characters. Removing this limitation would not be a small task.
Take a look at the source code for the ANTLRInputStream class - here.
As you can see, it attempts to hold the entire stream contents in a single char[]. That ain't going to work ... for huge input files. But simply fixing that by buffering the data in a larger data structure isn't going to be the answer either. If you look further down the file, there are a number of other methods that use int as the type for indexing the stream. They would need to be changed to use long ... and the changes will ripple out.
How can I solve this problem properly? How can I adjust such an input stream to handle this error?
Two approaches spring to mind:
Create your own version of ANTLR that supports large input files. This is a non-trivial project. I expect that the 32 bit assumption reaches into the code that ANTLR generates, etc.
Split your input files into smaller files before you attempt to parse them. Whether this is viable depends on the input syntax.
My recommendation would be the 2nd alternative. The problem with "supporting" huge input files (by in-memory buffering) is that it is going to be inefficient and memory wasteful ... and it ultimately doesn't scale.
You could also create an issue here, or ask on antlr-discussion.
i never stumbled upon this error, but i guess your array gets too big and it's index overflows (e.g., the integer wraps around and becomes negative). use another data structure, and most importantly, don't load all of the file at once (use lazy loading instead, that means, load only those parts that are being accessed)
I hope this will help http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
You might want to have some kind of buffer to read big files.
In a Android application I want to use Scanner class to read a list of floats from a text file (it's a list of vertex coordinates for OpenGL). Exact code is:
Scanner in = new Scanner(new BufferedInputStream(getAssets().open("vertexes.off")));
final float[] vertexes = new float[nrVertexes];
for(int i=0;i<nrVertexFloats;i++){
vertexes[i] = in.nextFloat();
}
It seems however that this is incredibly slow (it took 30 minutes to read 10,000 floats!) - as tested on the 2.1 emulator. What's going on?
I don't remember Scanner to be that slow when I used it on the PC (truth be told I never read more than 100 values before). Or is it something else, like reading from an asset input stream?
Thanks for the help!
As other posters have stated it's more efficient to include the data in a binary format. However, for a quick fix I've found that replacing:
scanner.nextFloat();
with
Float.parseFloat(scanner.next());
is almost 7 times faster.
The source of the performance issues with nextFloat are that it uses a regular expression to search for the next float, which is unnecessary if you know the structure of the data you're reading beforehand.
It turns out most (if not all) of the next* use regular expressions for a similar reason, so if you know the structure of your data it's preferable to always use next() and parse the result. I.E. also use Double.parseDouble(scanner.next()) and Integer.parseInt(scanner.next()).
Relevant source:
https://android.googlesource.com/platform/libcore/+/master/luni/src/main/java/java/util/Scanner.java
Don't know about Android, but at least in JavaSE, Scanner is slow.
Internally, Scanner does UTF-8 conversion, which is useless in a file with floats.
Since all you want to do is read floats from a file, you should go with the java.io package.
The folks on SPOJ struggle with I/O speed. It's is a Polish programming contest site with very hard problems. Their difference is that they accept a wider array of programming languages than other sites, and in many of their problems, the input is so large that if you don't write efficient I/O, your program will burst the time limit.
Of course, I advise against writing your own float parser, but if you need speed, that's still a solution.
For the Spotify Challenge they wrote a small java utility for parsing IO faster: http://spc10.contest.scrool.se/doc/javaio The utility is called Kattio.java and uses BufferedReader, StringTokenizer and Integer.parseInt/Double.parseDouble/Long.parseLong to read numerics.
Very Insightful post. Normally when I worked with Java thought Scanner is fastest on PC. The same when I try to use it in AsyncTask on Android, its WORST.
I think Android must come up with alternative to scanner. I was using scanner.nextFloat(); & scanner.nextDouble(); & scanner.nextInt(); all together which made my life sick. After I did my tracing of my app, found that the culprit was sitting hidden.
I did change to Float.parseFloat(scanner.next()); similarly Double.parseDouble(scanner.next()); & Integer.parseInt(scanner.next());, which certainly made my app quite fast I must agree, may be 60% faster.
If anyone have experienced the same, please post here. And I'm too looking out at alternative to Scanner API, any one have bright ideas can come forward and post here on reading file formats.
Yes I'm not seeing anything like this. I can read about 10M floats this way in 4 secs on the desktop, but it just can't be that different.
I'm trying to think of other explanations -- is it perhaps blocking in reading the input stream from getAssets()? I might try reading that resource fully, timing that, then seeing how much additional time is taken to scan.
Scanner may be part of the problem, but you need to profile your code to know. Alternatives may be faster. Here is a simple benchmark comparing Scanner and StreamTokenizer.
I got the exactly same problem. It took 10 minutes to read my 18 KB file. In the end I wrote a desktop application that converts those human readable numbers into machine-readable format, using DataOutputStream.
The result was astonishing.
Btw, when I traced it, most of the Scanner method calls involves regular expressions, whose implementation is provided by com.ibm.icu.** packages (IBM ICU project). It's really overkill.
The same goes for String.format. Avoid it in Android!
Basically I need to take a text file such as :
Fred
Bernie
Henry
and be able to read them from the file in the order of
Henry
Bernie
Fred
The actual file I'm reading from is >30MB and it would be a less than perfect solution to read the whole file, split it into an array, reverse the array and then go from there. It takes way too long. My specific goal is to find the first occurrence of a string (in this case it's "InitGame") and then return the position beginning of the beginning of that line.
I did something like this in python before. My method was to seek to the end of the file - 1024, then read lines until I get to the end, then seek another 1024 from my previous starting point and, by using tell(), I would stop when I got to the previous starting point. So I would read those blocks backwards from the end of the file until I found the text I was looking for.
So far, I'm having a heck of a time doing this in Java. Any help would be greatly appreciated and if you live near Baltimore it may even end up with you getting some fresh baked cookies.
Thanks!
More info:
I need to search backwards because the file I am reading is a logfile for a game that I host a server for (it's the |err| server on urban terror. check it out). The log file records every event that happens in the game and then my program will parse each event, process it and then act on it (for example, it keeps track of headshots for people and also will automatically kick people who are being d-bags). I need to search back to the most recent InitGame entry so that I can instantiate all the player objects and take care of whatever else needed to be taken care of since the beginning of that game. There are hundreds of InitGame events in the file, but I want the last one. If there is a better way of doing this that doesn't require searching backwards, please let me know.
Thanks
You can just repeat your Python solution using RandomAccessFile and may be a custom subclass of LineNumberReader (or just Reader) on top of it.
Linux has some great text parsing tools that may be better suited than trying to do it in Java.
On searching backwards, two answers come to mind. The first is to search forwards, and keep the last-found InitGame text around for the moment when you reach the end of the file (and overwrite it whenever another InitGame comes along as you are reading the file).
The second solution is to find out the file-size (using f.length()), divide that into large chunks that overlap by more than the maximum size of an InitGame snippet (to avoid problems due to splitting two chunks right on the interesting part), and start reading from the last one and progressing towards the file start (using a Reader's skip() function to jump to your desired reading position: no actual file-splitting is necessary). If you are sure that there are no funny multi-byte chars, RandomAccessFile can be useful.
The most efficient solution, of course, is to read the log-file output as it comes out, keeping a reference to the the last-found InitGame. That way you will never have to re-read the same data twice. You can even set things up so that your java program wakes up once every few seconds, looks at the file, and reads in the newly-added lines.
So, TIL that I need to be more verbose when I explain exactly what I'm doing. Basically I am writing a program that manages a game server that I run. In order for the program to be in sync with the game it needs to find the most recent InitGame line and then read from there so that it can record all this hits, kills, connects and disconnects that it needs to from the beginning of the round. Since a logfile can be quite huge (the last time I forgot to clean one up it was more than 500MB of text), rather than searching from the front, I want to search from the back. In Java there was no built-in way to do this. After searching over a good amount of the internets, I came upon this: http://mattfleming.com/node/11. From that I took out the BackwardsFileInputStream class and used that. Then in my application, I reverse the chars. Next time I should be able to construct my own method, now that I see how it's done and have a better understanding.
So, once the program has read the logfile from the most recent InitGame, it will mimic tail -f and read the logfile as it is written.
I have a .txt file that has numbers 1-31 in it called numbers. I've set up a scanner like this:
cardCreator = new Scanner (new File ("numbers"));
After this, I'm kind of confused on which methods I could use.
I want to set up a conditional statement that will go through numbers, evaluate each line (or numbers 1-31), and include each number a string called numbersonCard, if they meet certain criteria.
I can kind of picture how this would work (maybe using hasNext(), or nextLine() or something), but I'm still a little overwhelmed by the API...
Any suggestions as to which methods I might use?
Computer science is all about decomposition. Break your problem up into smaller ones.
First things first: can you read that file and echo it back?
Next, can you break each line into its number pieces?
Then, can you process each number piece as you wish?
You'll make faster progress if you can break the problem down.
I think Java Almanac is one of the best sources of example snippets around. Maybe it can help you with some of the basics.
Building upon duffymo's excellent advice, you might want to break your problem down into high-level pseudocode. The great thing about this is, you can actually write these steps as comments in your code, and then solve each bit as a separate problem. Better still, if you break down the problem right, you can actually place each piece into a separate method.
For example:
// Open the file
// Read in all the numbers
// For each number, does it meet our criteria?
// If so, add it to the list
Now you can address each part of the problem in a somewhat isolated fashion. Furthermore, you can start to see that you might break this down into methods which can open the file (and deal with any errors thrown by the API), read in all the numbers from the file, determine whether a given number meets your criteria, etc. A little clever method naming, and your code will literally read like the pseudocode, making it more maintainable in the future.
To give a little more specific help than the excellent answers duffymo and Rob have given, your instincts are right. You probably want something like this:
cardCreator = new Scanner (new File ("numbers"));
while (cardCreator.hasNextInt()) {
int number = cardCreator.nextInt();
// do something with number
}
cardCreator.close();
hasNext() and nextInt() will save you from getting a String from the Scanner and having to parse it yourself. I'm not sure if Scanner's default delimiter will interpret a Windows end-of-line CRLF to be one or two delimiters, though.