extraction of multiple occurrences of variable data from large string

extraction of multiple occurrences of variable data from large string - java

I have a very long string in a text file.It is basically the below string repeated around 1000 times (as one long string, not 1000 strings).The string has variables which change with each repetition (those in bold).I'd like to extract the variables in an automated way, and return the output into either a CSV or formatted txt file (Random Bank, Random Rate, Random Product)I can do this successfully using https://regex101.com, however it involves a lot of manual copy&paste.I'd like to write a bash script to automate extracting the information, but have no luck in attempting various grep commands.How can this be done? (I'd also consider doing in Java).
[{"AccountName":"Random Product","AccountType":"Variable","AccountTypeId":1,"AER":Random Rate,"CanManageByMobileApp":false,"CanManageByPost":true,"CanManageByTelephone":true,"CanManageInBranch":false,"CanManageOnline":true,"CanOpenByMobileApp":false,"CanOpenByPost":false,"CanOpenByTelephone":false,"CanOpenInBranch":false,"CanOpenOnline":true,"Company":"Random Bank","Id":"S9701Monthly","InterestPaidFrequency":"Monthly"

This is JSON formatted data which you can't parse with regular expression engines. Get a JSON parser. If this file is larger than, say, 1GB, find one that lets you 'stream' (which is the term for parsing it and dealing with the data as it parses, vs the more usual route which turns the entire input into an object; if the file is huge, that object'd be huge, might run out of memory - hence you'd need the streaming aspect).
Here is one tutorial for Jackson-streaming.

Related

How can I put >1000 strings in an array & make them searchable?

I'm learning to code by myself (Java and Android) and I'm working on creating an application for Android.
It's a language application (Grammar) to analyze verb.
What I have currently is I have a list of 1000 verb and each one has it's own way of forms and I'm stuck to find way to store them as strings but don't want to use a lot of string or too long array.
I was thinking of array of string but an array to have 1000 string not sure if this is really practical.
I thought of creating what I need in an excel and then to use this excel as the storage where the app can use it to search in it and show the results found there in a TextView but again not quite sure if this will work with Android.
Let's say I have the below 3 verbs in infinitive
Akl
ktv
hlk
Now the first verb can come in another 2 forms (Nakl - Hikel) and the other verbs too have their own forms.
What I want to do is, when a user type the verb whether in past or present for example (Akled "past" or hikeling "present") the system will substring only the verb by removing the ending and then use what is left (for example: akled ---> akl) to show the other forms, in this case if it is "akled" then system will use "akl" and show nakl - hikel.
Example:
User type in text box (akled) and press analyse
System will do the following:
substring the verb only (akl)
then based on this verb will show other forms, which are (nakl - hikel).
Is this doable with huge number of verbs? Let's say each verb has only 2 other forms, so based on this the 1000 verb has 2000 other forms.

Don't bother about loading too many strings in the memory. Strings are internally represented as array of characters and char type in Java takes 2 bytes. So, if you were to keep 100,000 strings (each 20 characters long), then the total memory occupied by the String[] of 100,000 elements will be 100,000 * 20 * 2 = 4,000,000 bytes = 4 MB. And JVM heap size is usually in Gigabytes, so you shouldn't be bothered about whether you should load this much strings in memory or not. Even if you load 10 times the above, i.e. 1,000,000 strings, you'll be occupying only 40 MB of memory.

you can create your own String resource files in android, where you can store there all the strings used on the app: https://developer.android.com/guide/topics/resources/string-resource
above link will show you the format to use, where to put that file of strings in your android project and how to use it.
on this link you have an example of how to get an string array from resources:
https://www.android-examples.com/get-string-array-from-strings-xml-in-android/

Load a Perl Hash into Java

I have a big .pm File, which only consist of a very big Perl hash with lots of subhashes. I have to load this hash into a Java program, do some work and changes on the data lying below and save it back into a .pm File, which should look similar to the one i started with.
By now, i tried to convert it linewise by regex and string matching, converting it into a XML Document and later Elementwise parse it back into a perl hash.
This somehow works, but seems quite dodgy. Is there any more reliable way to parse the perl hash without having a perl runtime installed?

You're quite right, it's utterly filthy. Regex and string for XML in the first place is a horrible idea, and honestly XML is probably not a good fit for this anyway.
I would suggest that you consider JSON. I would be stunned to find java can't handle JSON and it's inherently a hash-and-array oriented data structure.
So you can quite literally:
use JSON;
print to_json ( $data_structure, { pretty => 1 } );
Note - it won't work for serialising objects, but for perl hash/array/scalar type structures it'll work just fine.
You can then import it back into perl using:
my $new_data = from_json $string;
print Dumper $new_data;
Either Dumper it to a file, but given you requirement is multi-language going forward, just using native JSON as your 'at rest' data is probably a more sensible choice.
But if you're looking at parsing perl code within java, without a perl interpreter? No, that's just insanity.

Java Compress Multiple strings with the same rule

I'm creating an android application that needs a massive database (70mb but the application has to work offline...). The largest table has two columns, a keyword and a definition. The definitions themselves are relatively short, usually under 2000 characters, so compressing each one individually wouldn't save me very much since compression libraries store the rules decompress the strings as part of the compressed string.
However if I could compress all of these strings with the same set of rules and then store just the compressed data in the DB and the rules elsewhere, I could save a lot of space. Does anyone know of a library that will let me do something like this?
Desired behavior:
public String getDefinition(String keyword) {
DecompressionObject decompresser = new DecompressionObject(RULES_FILE);
byte[] data = queryDatabase(keyword);
return decompresser.decompress(keyword);
}

The "rules" as you call them is not why you are getting limited compression efficacy. The Huffman code table that precedes the data in a deflate stream is around 80 bytes, and so is not significant compared to your 2000 byte string.
What is limiting the compression efficacy is simply a lack of history from which to draw matching strings. The only place to look for matching strings is in the 2000 characters, and then only in the preceding characters at any point in the compression.
What you could do to improve compression would be to create a dictionary of common strings that would be used as history to precede each string you are compressing. Then that same dictionary is provided to the decompressor ahead of time for it to use to decompress each string. This assumes that there is some commonality of content in your ensemble of strings.
zlib provides these functions in deflateSetDictionary() and inflateSetDictionary().

best way of loading a large text file in java

I have a text file, with a sequence of integer per line:
47202 1457 51821 59788
49330 98706 36031 16399 1465
...
The file has 3 million lines of this format. I have to load this file into the memory and extract 5-grams out of it and do some statistics on it. I do have memory limitation (8GB RAM). I tried to minimize the number of objects I create (only have 1 class with 6 float variables, and some methods). And each line of that file, basically generates number of objects of this class (proportional to the size of the line in temrs of #ofwords). I started to feel that Java is not a good way to do these things when C++ is around.
Edit:
Assume that each line produces (n-1) objects of that class. Where n is the number of tokens in that line separated by space (i.e. 1457). So considering the average size of 10 words per line, each line gets mapped to 9 objects on average. So, there will be 9*3*10^6 objects.So, the memory needed is: 9*3*10^6*(8 bytes obj header + 6 x 4 byte floats) + (a map(String,Objects) and another map (Integer,ArrayList(Objects))). I need to keep everything in the memory, because there will be some mathematical optimization happening afterwards.

Reading/Parsing the file:
The best way to handle large files, in any language, is to try and NOT load them into memory.
In java, have a look at MappedByteBuffer. it allows you to map a file into process memory and access its contents without loading the whole thing into your heap.
You might also try reading the file line-by-line and discarding each line after you read it - again to avoid holding the entire file in memory at once.
Handling the resulting objects
For dealing with the objects you produce while parsing, there are several options:
Same as with the file itself - if you can perform whatever it is you want to perform without keeping all of them in memory (while "streaming" the file) - that is the best solution. you didnt describe the problem youre trying to solve so i dont know if thats possible.
Compression of some sort - switch from Wrapper objects (Float) to primitives (float), use something like the flyweight pattern to store your data in giant float[] arrays and only construct short-lived objects to access it, find some pattern in your data that allows you to store it more compactly
Caching/offload - if your data still doesnt fit in memory "page it out" to disk. this can be as simple as extending guava to page out to disk or bringing in a library like ehcache or the likes.
a note on java collections and maps in particular
For small objects java collections and maps in particular incur a large memory penalty (due mostly to everything being wrapped as Objects and the existence of the Map.Entry inner class instances). at the cost of a slightly less elegant API, you should probably look at gnu trove collections if memory consumption is an issue.

Optimal would be to hold only integers and line ends.
To that end, one way would be: convert the file to two files:
one binary file of integers (4 bytes)
one binary file with indexes where the next line would start.
For this one can use a Scanner to read, and a DataOutputStream+BufferedOutputStream to write.
Then you can load those two files in arrays of primitive type:
int[] integers = new int[(int)integersFile.length() / 4];
int[] lineEnds = new int[(int)lineEndsFile.length() / 4];
Reading can be done with a MappedByteBuffer.toIntBuffer(). (You then would not even need the arrays, but it would become a bit COBOL like verbose.)

Java Serialization to transfer data between any language

Question:
Instead of writing my own serialization algorithm; would it be possible to just use the built in Java serialization, like I have done below, while still having it work across multiple languages?
Explanation:
How I imagine it working, would be as follows: I start up a process, that will be be a language-specific program - written in that language. So I'd have a CppExecutor.exe file, for example. I would write data to a stream to this program. The program would then do what it needs to do, then return a result.
To do this, I would need to serialize the data in some way. The first thing that came to mind was the basic Java Serialization with the use of an ObjectInputStream and ObjectOutputStream. Most of what I have read has only stated that the Java serialization is Java-to-Java applications.
None of the data will ever need to be stored in a file. The method of transferring these packets would be through a java.lang.Process, which I have set up already.
The data will be composed of the following:
String - Mostly containing information that is displayed to the user.
Integer - most likely 32-bit. Won't need to deal with times.
Float- just to handle all floating-point values.
Character - to ensure proper types are used.
Array - Composed of any of the elements in this list.
The best way I have worked out how to do this is as follows: I would start with a 4-byte magic number - just to ensure we are working with the correct data. Following, I would have an integer specifying how many elements there are. After that, for each of the elements I would have: a single byte, signifying the data type (of the above), following by any crucial information, e.x: length for the String and Array. Then, the data that follows.
Side-notes:
I would also like to point out that a lot of these calculations will be taking place, where every millisecond could matter. Due to this, a text-based format (such as JSON) may produce far larger operation times. Considering that non of the packets would need to be interpreted by a human, using only bytes wouldn't be an issue.

I'd recommend Google protobuf: it is binary, stable, proven, and has bindings for all languages you've mentioned. Moreover, it also handles structured data nicely.

There is a binary json format called bson.
I would also like to point out that a lot of these calculations will be taking place, so a text-based format (such as JSON) may produce far larger operation times.
Do not optimize before you measured.
Premature optimization is the root of all evil.
Can you have a try and benchmark the throughput? See if it fits your needs?

Thrift,Protobuf,JSON,MessagePack
complexity of installation Thrift >> Protobuf > BSON > MessagePack > JSON
serialization data size JSON > MessagePack > Binary Thrift > Compact Thrift > Protobuf
time cost Compact Thrift > Binary Thrift > Protobuf > JSON > MessagePack

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.