How to parse large files using flatpack

How to parse large files using flatpack - java

I need to parse files that may be quite large, possibly 100s of megabytes and millions of lines. I have been trying to do this using FlatPack. I would think the way to do this would be to use the buffered parsers and the new stream methods. But, despite that dataset.next() returns true for the correct number of records, the Optional returned by dataset.getRecord() never contains a value.
I have looked at this example/test but it only counts the number of record and does not actually do anything with the content.
example/test

You can use the class BuffReaderParseFactory instead of DefaultParserFactory.
It will read one record from the input file only when you call "next()".

The explanations for both DefaultParserFactory and BuffReaderParseFactory are not exactly helpful. Both libraries said to return PZParser (from newDelimitedParser) but only one of them returns an actual value from a record. Based on the examples I've seen, I think BuffReaderParseFactory is just for checking performance (hence should be faster) and DefaultParserFactory on the other hand contains all the records.

Related

extraction of multiple occurrences of variable data from large string

I have a very long string in a text file.It is basically the below string repeated around 1000 times (as one long string, not 1000 strings).The string has variables which change with each repetition (those in bold).I'd like to extract the variables in an automated way, and return the output into either a CSV or formatted txt file (Random Bank, Random Rate, Random Product)I can do this successfully using https://regex101.com, however it involves a lot of manual copy&paste.I'd like to write a bash script to automate extracting the information, but have no luck in attempting various grep commands.How can this be done? (I'd also consider doing in Java).
[{"AccountName":"Random Product","AccountType":"Variable","AccountTypeId":1,"AER":Random Rate,"CanManageByMobileApp":false,"CanManageByPost":true,"CanManageByTelephone":true,"CanManageInBranch":false,"CanManageOnline":true,"CanOpenByMobileApp":false,"CanOpenByPost":false,"CanOpenByTelephone":false,"CanOpenInBranch":false,"CanOpenOnline":true,"Company":"Random Bank","Id":"S9701Monthly","InterestPaidFrequency":"Monthly"

This is JSON formatted data which you can't parse with regular expression engines. Get a JSON parser. If this file is larger than, say, 1GB, find one that lets you 'stream' (which is the term for parsing it and dealing with the data as it parses, vs the more usual route which turns the entire input into an object; if the file is huge, that object'd be huge, might run out of memory - hence you'd need the streaming aspect).
Here is one tutorial for Jackson-streaming.

Go back 'n' lines in file using Stream.lines

I need to build an application which scans through a large amount of files. These files contain blocks with some data about a sessions, in which each line has a different value. E.g.: "=ID: 39487".
At that point I have that line, but the problem I now face is that I need the value n lines above that ID. I was thinking about an Iterator but it only has forward methods. I also thought about saving the results in a List but that defies the reason to use Stream and some files are huge so that would cause memory problems.
I was wondering if something like this is possible using the Stream API (Files)? Or perhaps a better question, is there a better way to approach this?
Stream<String> lines = Files.lines(Paths.get(file.getName()));
Iterator<String> search = lines.iterator();

You can't arbitrarily read backwards and forwards through the file with the same reader (no matter if you're using streams, iterators, or a plain BufferedReader.)
If you need:
m lines before a given line
n lines after the given line
You don't know the value of m and n in advance, until you reach that line
...then you essentially have three options:
Read the whole file once, keep it in memory, and then your task is trivial (but this uses the most memory.)
Read the whole file once, mark the line numbers that you need, then do a second pass where you extract the lines you require.
Read the whole file once, storing some form of metadata about line lengths as you go, then use a RandomAccessFile to extract the specific bits you need without having to read the whole file again.
I'd suggest given the files are huge, the second option here is probably the most realistic. The third will probably give you better performance, but will require much more in the way of development effort.
As an alternative if you can guarantee that both n and m are below a certain value, and that value is a reasonable size - you could also just keep a certain number of lines in a buffer as you're processing the file, and read through that buffer when you need to read lines "backwards".

Try my library. abacus-util
try(Reader reader = new FileReader(yourFile)) {
StreamEx.of(reader)
.sliding(n, n, ArrayList::new)
.filter(l -> l.get(l.size() - 1).contains("=ID: 39487"))
./* then do your work */
}
No matter how big your file is. as long as n is small number, not millions

BufferedReader taking too long [duplicate]

This question already has answers here:
JAVA - Best approach to parse huge (extra large) JSON file
(3 answers)
OutOfMemory exception in a lot of memory
Closed 5 years ago.
This is to read a file faster not write it.
I have a 150MB file which has a JSON object inside it. I currently use the following code to read it:
String filename ="/tmp/fileToRead";
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filename), Charset.forName("UTF-8")));
decompressedString = reader.readLine();
reader.close();
JSONObject obj = new JSONObject(decompressedString);
JSONArray profileData = obj.getJSONObject("profileData").getJSONArray("children");
....
It is a single line file and since it is JSON I can't split it ( or atleast I think so). Reading the file gives me a OutOfMemory Error or a TLE. The file takes more than 7 secs to be read and that results in the TLE since the execution of the whole code cannot go beyond 7 seconds. I get the OOM on decompressedString = reader.readLine();.
Is there a way I can reduce the memory used or the time it takes to be read completely?

You have several problems at hand:
You're preemptively parsing too much.
The error you get happens already when you read the line since you said "I get the OOM on decompressedString = reader.readLine();".
You should never try to read data line by line. BufferedReader.readLine() will block until you've read the character \r or \n or the sequence \r\n. When processing data of any length, you're never sure you'll get one of those characters. Also, you're never sure you'll get of those characters outside of the data itself. So your string may be too long or malformed. So don't ever pretend to know the format. BufferedReader.readLine() must be used when parsing, not when acquiring data.
You're not using an appropriate library for your use-case
Reading your JSON is important, yes, but you're reading too much at once. When creating your JSON, you might want to build it from a stream (one of InputStream, Reader or any nio's Channel/Buffer).
Currently you're making your JSON from a String. A huge one. So I can safely assume you're going to require at one point twice the memory you need. One time in the String and one time in the finalized object.
To reduce that, use an appropriate library to which you can pass one of the stream mentioned above. I mentioned in my comments the following: Gson, JSON.simple and Jackson.
Your file may be too big anyways.
If you get your data and you want to acquire only subset of it (here, you want everything under {"profileData":{"children": <DATA>}}). But you probably have way too much. How many elements exist at the same level as profileData? How many elements exist at the same level as children? Do you know? Probably way too much. All that is not under profileData.children is useless. What percentage of your total data is that? 50%? 90%? 99%?
To solve this, you probably want one of two things: you want less data or you want to be able to focus your request.
If you want less data, ask your data provider to give you less: only what you need. Why get more than that? It makes no sense. Tell him so and say "I want less".
If you want focused data, use a library that allows you to both parse and reduce the amount of data. You might want to have a library that lets you say this: "parse this JSON and return only the processingData.children element". Unfortunately I know no library that does it. If others do, please add a comment or answer. Apparently, Gson is able to do so if you use the JsonReader yourself and selectively use skipValue().

SingleColumnValueFilter has no impact on result

hy,
this question is pretty similar to SingleColumnValueFilter not returning proper number of rows .
I use four SingleColumnValueFilter's w/ operator EQUAL and add them to a FilterList with Operator MUST_PASS_ONE. the number of results is the same as w/o setting the FilterList. The value to compare is a byte[] that should be correct as I just store the values from previous results. (it is an IP address that I convert to InetAddress, new InetAddress(value as byte[]), when retrieving the data, and for the query described I just call InetAddress.getAddress which returns a byte[])
Do you have any ideas what might be the problem? Am I using the Filter wrong?
EDIT:
I also used the original values retrieved by the query as value for SingleColumnValueFilter, and there was no difference in the results, thus the byte[] contents can't be the problem.

I think I can give the answer myself, sorry for not debugging and checking all the hbase code before.
I just checked the implementation of the compare algorithm (which is lexicographically), and thus i realized that the length is not taken into account, though I thought it would be filled up w/ zero's; unfortunately it is not.
The only reasonable option would be to create a custom comparator (eg see How do you use a custom comparator with SingleColumnValueFilter on HBase?)

comparing "the likes" smartly

Suppose you need to perform some kind of comparison amongst 2 files. You only need to do it when it makes sense, in other words, you wouldn't want to compare JSON file with Property file or .txt file with .jar file
Additionally suppose that you have a mechanism in place to sort all of these things out and what it comes down to now is the actual file name. You would want to compare "myFile.txt" with "myFile.txt", but not with "somethingElse.txt". The goal is to be as close to "apples to apples" rules as possible.
So here we are, on one side you have "myFile.txt" and on another side you have "_myFile.txt", "_m_y_f_i_l_e.txt" and "somethingReallyClever.txt".
Task is to pick the closest name to later compare. Unfortunately, identical name is not found.
Looking at the character composition, it is not hard to figure out what the relationship is. My algo says:
_myFile.txt to _m_y_f_i_l_e.txt 0.312
_myFile.txt to somethingReallyClever.txt 0.16
So _m_y_f_i_l_e.txt is closer to_myFile.txt then somethingReallyClever.txt. Fantastic. But also says that ist is only 2 times closer, where as in reality we can look at the 2 files and would never think to compare somethingReallyClever.txt with _myFile.txt.
Why?
What logic would you suggest i apply to not only figure out likelihood by having chars on the same place, but also test whether determined weight makes sense?
In my example, somethingReallyClever.txt should have had a weight of 0.0
I hope i am being clear.
Please share your experience and thoughts on this.
(whatever approach you suggest should not depend on number of characters filename consists out of)

Possibly helpful previous question which highlights several possible algorithms:
Word comparison algorithm
These algorithms are based on how many changes would be needed to get from one string to the other - where a change is adding a character, deleting a character, or replacing a character.
Certainly any sensible metric here should have a low score as meaning close (think distance between the two strings) and larger scores as meaning not so close.

Sounds like you want the Levenshtein distance, perhaps modified by preconverting both words to the same case and normalizing spaces (e.g. replace all spaces and underscores with empty string)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.