If there is an input file with tons of records, each record with be one line, each record is consisted of one id number, time the record is created and record content. Then what will be the best way to read and parse the file?
For example, the input is:
123-456-789 1:23pm Jan 4, 2014 I AM THE CONTENT!
987-654-321 3:21pm Apr1, 2014 I AM THE CONTENT TOO!
…
To read one line each time, I believe there is no much difference between scanner and bufferedReader because scanner also has 1k buffer. So may I do:
Scanner scan = new Scanner(new File("filename"))?
Then after I get one line, should I make another scanner object to parse the line and get each field (I can give the line as the input for the scanner)? Or is there any other better solution?
For experienced programmer, what should be the best way (fast, better performance) to do read and parse such a file with tons of records in real world? Thank you!
Unless 'tons' means hundreds of millions of lines it isn't likely to make any significant difference which you use, but you only need one Scanner object for this task, not one per line.
NB BufferedReader has a 4k buffer, so your only stated reason for thinking there is 'not much difference' is out the window. The fact that Scanner is a higher-level API with tokenising features also seems to have escaped you.
Related
I need to build an application which scans through a large amount of files. These files contain blocks with some data about a sessions, in which each line has a different value. E.g.: "=ID: 39487".
At that point I have that line, but the problem I now face is that I need the value n lines above that ID. I was thinking about an Iterator but it only has forward methods. I also thought about saving the results in a List but that defies the reason to use Stream and some files are huge so that would cause memory problems.
I was wondering if something like this is possible using the Stream API (Files)? Or perhaps a better question, is there a better way to approach this?
Stream<String> lines = Files.lines(Paths.get(file.getName()));
Iterator<String> search = lines.iterator();
You can't arbitrarily read backwards and forwards through the file with the same reader (no matter if you're using streams, iterators, or a plain BufferedReader.)
If you need:
m lines before a given line
n lines after the given line
You don't know the value of m and n in advance, until you reach that line
...then you essentially have three options:
Read the whole file once, keep it in memory, and then your task is trivial (but this uses the most memory.)
Read the whole file once, mark the line numbers that you need, then do a second pass where you extract the lines you require.
Read the whole file once, storing some form of metadata about line lengths as you go, then use a RandomAccessFile to extract the specific bits you need without having to read the whole file again.
I'd suggest given the files are huge, the second option here is probably the most realistic. The third will probably give you better performance, but will require much more in the way of development effort.
As an alternative if you can guarantee that both n and m are below a certain value, and that value is a reasonable size - you could also just keep a certain number of lines in a buffer as you're processing the file, and read through that buffer when you need to read lines "backwards".
Try my library. abacus-util
try(Reader reader = new FileReader(yourFile)) {
StreamEx.of(reader)
.sliding(n, n, ArrayList::new)
.filter(l -> l.get(l.size() - 1).contains("=ID: 39487"))
./* then do your work */
}
No matter how big your file is. as long as n is small number, not millions
This question already has answers here:
JAVA - Best approach to parse huge (extra large) JSON file
(3 answers)
OutOfMemory exception in a lot of memory
Closed 5 years ago.
This is to read a file faster not write it.
I have a 150MB file which has a JSON object inside it. I currently use the following code to read it:
String filename ="/tmp/fileToRead";
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filename), Charset.forName("UTF-8")));
decompressedString = reader.readLine();
reader.close();
JSONObject obj = new JSONObject(decompressedString);
JSONArray profileData = obj.getJSONObject("profileData").getJSONArray("children");
....
It is a single line file and since it is JSON I can't split it ( or atleast I think so). Reading the file gives me a OutOfMemory Error or a TLE. The file takes more than 7 secs to be read and that results in the TLE since the execution of the whole code cannot go beyond 7 seconds. I get the OOM on decompressedString = reader.readLine();.
Is there a way I can reduce the memory used or the time it takes to be read completely?
You have several problems at hand:
You're preemptively parsing too much.
The error you get happens already when you read the line since you said "I get the OOM on decompressedString = reader.readLine();".
You should never try to read data line by line. BufferedReader.readLine() will block until you've read the character \r or \n or the sequence \r\n. When processing data of any length, you're never sure you'll get one of those characters. Also, you're never sure you'll get of those characters outside of the data itself. So your string may be too long or malformed. So don't ever pretend to know the format. BufferedReader.readLine() must be used when parsing, not when acquiring data.
You're not using an appropriate library for your use-case
Reading your JSON is important, yes, but you're reading too much at once. When creating your JSON, you might want to build it from a stream (one of InputStream, Reader or any nio's Channel/Buffer).
Currently you're making your JSON from a String. A huge one. So I can safely assume you're going to require at one point twice the memory you need. One time in the String and one time in the finalized object.
To reduce that, use an appropriate library to which you can pass one of the stream mentioned above. I mentioned in my comments the following: Gson, JSON.simple and Jackson.
Your file may be too big anyways.
If you get your data and you want to acquire only subset of it (here, you want everything under {"profileData":{"children": <DATA>}}). But you probably have way too much. How many elements exist at the same level as profileData? How many elements exist at the same level as children? Do you know? Probably way too much. All that is not under profileData.children is useless. What percentage of your total data is that? 50%? 90%? 99%?
To solve this, you probably want one of two things: you want less data or you want to be able to focus your request.
If you want less data, ask your data provider to give you less: only what you need. Why get more than that? It makes no sense. Tell him so and say "I want less".
If you want focused data, use a library that allows you to both parse and reduce the amount of data. You might want to have a library that lets you say this: "parse this JSON and return only the processingData.children element". Unfortunately I know no library that does it. If others do, please add a comment or answer. Apparently, Gson is able to do so if you use the JsonReader yourself and selectively use skipValue().
I understand there are two ways read big text files in java. One is using scanner and one is using bufferedreader.
Scanner reader = new Scanner(new FileInputStream(path));
while (reader.hasNextLine()){
String tempString = reader.nextLine();
System.out.println(java.lang.Runtime.getRuntime().totalMemory()/(1024*1024.0));
}
And the number to be printed is always stable around some value.
However when I use bufferedReader as per edit below the number is not stable, it may increase in a sudden (about 20mb) in one line and then remain the same for many lines(like 8000 lines). And the process repeats.
Anyone knows why?
UPDATE
I typed the second method using BufferedReader wrong here is what it should be
BufferedReader reader = new BufferedReader
(new InputStreamReader(new FileInputStream(path)),5*1024*1024);
for(String s = null;(s=reader.readLine())!=null; ){
System.out.println(java.lang.Runtime.getRuntime().totalMemory()/(1024*1024.0));
}
or using while loop
String s;
while ((s=reader.readLine())!=null ){
System.out.println(java.lang.Runtime.getRuntime().totalMemory()/(1024*1024.0));
}
To be more specific, here is a result of test case reading 250M file
Scanner case:
linenumber---totolmemory
5000---117.0
10000---112.5
15000---109.5
20000---109.5
25000---109.5
30000---109.5
35000---109.5
40000---109.5
45000---109.5
50000---109.5
BufferedReader case:
linenumber---totolmemory
5000---123.0
10000---155.5
15000---155.5
20000---220.5
25000---220.5
30000---220.5
35000---220.5
40000---220.5
45000---220.5
50000---211.0
However the scanner is slow and that's why I try to avoid it.
And I check the bufferedReader case the total memory increases suddenly in a single random line.
Just by itself, a Scanner is not particularly good for big text files.
Scanner and BufferedReader are not comparable. You can use a BufferedInputStream in a Scanner - then you'll have the same thing, with the Scanner adding a lot more of "stream" reading functionality than just lines.
Looking at totalMemory isn't particularly useful. To cite Javadoc: Returns the total amount of memory in the Java virtual machine. The value returned by this method may vary over time, depending on the host environment.
Try freeMemory, which is a little more interesting, reflecting the phases of GC that occur every now and then.
Later
Comment on Scanner being slow: Reading a line merely requires scanning bytes for the line separator, and that's how the BufferedReader does it. The Scanner, however, cranks up java.util.regex.Matcher for this task (as it fits better into its overall design). Using the Scanner just for reading lines is breaking butterflies on the wheel.
So I have a file to read in and I know how the data will be set out. For example I know that the first token of each new line is going to be a double.
I had been using a Scanner and was simply using scan.nextDouble() to read in the double however I was told of Double.parseDouble(scan.next()) instead which sped up the process of reading in the data from the file from 30 seconds down to ~5 seconds.
The same happened with scan.nextInt() vs. Integer.parseInt(scan.next()).
In the file I was reading it went int double int int for each line for about 40,000 lines.
So what makes it so much faster?
It's all because scan.nextDouble() find the nearest Doublelike value from the following Stream. it can not sure the next string value will be a doublelike value, for example
s = "abcde1234.5"
scan.nextDouble(s) will be 1234.5 but Double.parseDouble(scan.next()) will throw an error.
more details you will find in the source code.
The Scanner next<Type> methods are doing additional work besides simply reading in the next token and calling the appropriate parser. First they check against a regular expression that the token is valid for that type, then they massage it to deal with locale-specific bits (such as group separator, decimal separator, etc.), then finally pass that to the parser.
If you are sure that your input is in the exact format you describe and you don't need to account for any potential differences caused by the input coming from a different locale, etc., then by all means use the optimization you were informed of.
I am learning the basics of Java IO and cannot find what I would think would be covered in basic discussions of IO in java: without getting into subtleties or complexities (unless necessary), what is the very basic explanation of when you would choose one vs. when you would choose the other for output to a file (Formatter vs. FileOutputStream)?
I assume the same explanation will hold for Scanner vs. FileInputStream.
You use an OutputStream (possibly a FileOutputStream) to write bytes.
You use a Formatter to write formatted text.
The first is very efficient but you have to know what bytes to write. The second gives you flexible formatting features, but is limited in what it can write, and is likely to be less efficient than the first.
The Formatter and Scanner constructors that take file specifications as arguments are just a convenience for combining a file output or input stream with a Formatter or Scanner that operates on a stream. Use them whenever you were going to wrap your stream in a Formatter or Scanner anyway and you have no separate need for the stream object.