Fast text file parsing in java

Fast text file parsing in java - java

I have a number of text files which are in a fixed, repeated format like:
Q 32,0 16
q 27
b 21
I 0
P 1
d 0
m 31,0
Q 48,0 16
q 27
b 2
I 2
P 1
d 0
m 31,0
.
.
.
I want to parse them in Java. What I want to know is the fastest method to parse such a text file. I can change the output format of the text file if that helps with the performance, as the only requirement here is speed of parsing.
I can use external libraries too.

The fastest speed of parsing is to use a binary format. I sugegst you use native byte order and you should be able to read about 20 million entries per second for this sort of data.
An example of reading and writing binary data with a high throughput AND low latency is here.
https://github.com/peter-lawrey/Java-Chronicle
This format is designed to be read as it is written (with less than one micro-second latency between processes)
You could use a simpler format than this as I suspect all you need is high throughput. ;)
BTW: The library supports GC-less read and writing of text such as long and double values directory to/from a memory mapped ByteBuffer. As such it can be used as a fast text logger supporting over one million realistic text messages per second.

Related

extraction of multiple occurrences of variable data from large string

I have a very long string in a text file.It is basically the below string repeated around 1000 times (as one long string, not 1000 strings).The string has variables which change with each repetition (those in bold).I'd like to extract the variables in an automated way, and return the output into either a CSV or formatted txt file (Random Bank, Random Rate, Random Product)I can do this successfully using https://regex101.com, however it involves a lot of manual copy&paste.I'd like to write a bash script to automate extracting the information, but have no luck in attempting various grep commands.How can this be done? (I'd also consider doing in Java).
[{"AccountName":"Random Product","AccountType":"Variable","AccountTypeId":1,"AER":Random Rate,"CanManageByMobileApp":false,"CanManageByPost":true,"CanManageByTelephone":true,"CanManageInBranch":false,"CanManageOnline":true,"CanOpenByMobileApp":false,"CanOpenByPost":false,"CanOpenByTelephone":false,"CanOpenInBranch":false,"CanOpenOnline":true,"Company":"Random Bank","Id":"S9701Monthly","InterestPaidFrequency":"Monthly"

This is JSON formatted data which you can't parse with regular expression engines. Get a JSON parser. If this file is larger than, say, 1GB, find one that lets you 'stream' (which is the term for parsing it and dealing with the data as it parses, vs the more usual route which turns the entire input into an object; if the file is huge, that object'd be huge, might run out of memory - hence you'd need the streaming aspect).
Here is one tutorial for Jackson-streaming.

Cap'n Proto - Finding Message Size in Java

I am using a TCP Client/Server to send Cap'n Proto messages from C++ to Java.
Sometimes the receiving buffer may be overfilled or underfilled and to handle these cases we need to know the message size.
When I check the size of the buffer in Java I get 208 bytes, however calling
MyModel.MyMessage.STRUCT_SIZE.total()
returns 4 (not sure what unit of measure is being used here).
I notice that 4 divides into 208, 52 times. But I don't know of a significant conversion factor using 52.
How do I check the message size in Java?

MyMessage.STRUCT_SIZE represents the constant size of that struct itself (measured in 8-byte words), but if the struct contains non-trivial fields (like Text, Data, List, or other structs) then those take up space too, and the amount of space they take is not constant (e.g. Text will take space according to how long the string is).
Generally you should try to let Cap'n Proto directly write to / read from the appropriate ByteChannels, so that you don't have to keep track of sizes yourself. However, if you really must compute the size of a message ahead of time, you could do so with something like:
ByteBuffer[] segments = message.getSegmentsForOutput();
int total = (segments.length / 2 + 1) * 8; // segment table
for (ByteBuffer segment: segments) {
total += segment.remaining();
}
// now `total` is the total number of bytes that will be
// written when the message is serialized.
On the C++ size, you can use capnp::computeSerializedSizeInWords() from serialize.h (and multiply by 8).
But again, you really should structure your code to avoid this, by using the methods of org.capnproto.Serialize with streaming I/O.

Best approach to read data from tera terminal into Java

Details: Read data from EEPROM -> output to tera term -> save off log file -> parse through it with java program.
What I have: All EEPROM reads are good. I then take the hex value I read and using sprintf (in Atmel Studio) turn each byte into its 2 respective ASCII codes. Then I send this out to tera term. Output is as follows:
00=00=00=c5=03=76=00=01=00=05=00=cf=00=01=fa=ef=
00=00=00=c6=00=44=00=01=00=05=00=cf=00=00=fe=21=
00=00=00=c8=02=41=00=01=00=05=00=d0=00=01=fc=20=
etc...
I can then parse through it in this manner using a java program I slightly modified:
Seconds: 0x15150380 Milliseconds: 0x0062 Cycle Count: 0x0001 Assert Code: 0x0005 Parameter: 0x00d1 Data Value: 0x006c Checksum: 0xfa5e
(first 4 bytes are seconds, next 2 are milliseconds, etc.)
Next:
For starters I would just like to read each line (1 log) into a byte array so I can verify packet with checksum at end, etc.
My questions:
1) How to read that type of output to an array
2) Would it be better/easier to output data to teraterminal in a different manner? And if so any pointers are appreciated.
Completely new to Java so trying to piece throught this...
Thanks for the help.

best way of loading a large text file in java

I have a text file, with a sequence of integer per line:
47202 1457 51821 59788
49330 98706 36031 16399 1465
...
The file has 3 million lines of this format. I have to load this file into the memory and extract 5-grams out of it and do some statistics on it. I do have memory limitation (8GB RAM). I tried to minimize the number of objects I create (only have 1 class with 6 float variables, and some methods). And each line of that file, basically generates number of objects of this class (proportional to the size of the line in temrs of #ofwords). I started to feel that Java is not a good way to do these things when C++ is around.
Edit:
Assume that each line produces (n-1) objects of that class. Where n is the number of tokens in that line separated by space (i.e. 1457). So considering the average size of 10 words per line, each line gets mapped to 9 objects on average. So, there will be 9*3*10^6 objects.So, the memory needed is: 9*3*10^6*(8 bytes obj header + 6 x 4 byte floats) + (a map(String,Objects) and another map (Integer,ArrayList(Objects))). I need to keep everything in the memory, because there will be some mathematical optimization happening afterwards.

Reading/Parsing the file:
The best way to handle large files, in any language, is to try and NOT load them into memory.
In java, have a look at MappedByteBuffer. it allows you to map a file into process memory and access its contents without loading the whole thing into your heap.
You might also try reading the file line-by-line and discarding each line after you read it - again to avoid holding the entire file in memory at once.
Handling the resulting objects
For dealing with the objects you produce while parsing, there are several options:
Same as with the file itself - if you can perform whatever it is you want to perform without keeping all of them in memory (while "streaming" the file) - that is the best solution. you didnt describe the problem youre trying to solve so i dont know if thats possible.
Compression of some sort - switch from Wrapper objects (Float) to primitives (float), use something like the flyweight pattern to store your data in giant float[] arrays and only construct short-lived objects to access it, find some pattern in your data that allows you to store it more compactly
Caching/offload - if your data still doesnt fit in memory "page it out" to disk. this can be as simple as extending guava to page out to disk or bringing in a library like ehcache or the likes.
a note on java collections and maps in particular
For small objects java collections and maps in particular incur a large memory penalty (due mostly to everything being wrapped as Objects and the existence of the Map.Entry inner class instances). at the cost of a slightly less elegant API, you should probably look at gnu trove collections if memory consumption is an issue.

Optimal would be to hold only integers and line ends.
To that end, one way would be: convert the file to two files:
one binary file of integers (4 bytes)
one binary file with indexes where the next line would start.
For this one can use a Scanner to read, and a DataOutputStream+BufferedOutputStream to write.
Then you can load those two files in arrays of primitive type:
int[] integers = new int[(int)integersFile.length() / 4];
int[] lineEnds = new int[(int)lineEndsFile.length() / 4];
Reading can be done with a MappedByteBuffer.toIntBuffer(). (You then would not even need the arrays, but it would become a bit COBOL like verbose.)

ASCII non readable characters 28, 29 31

I am processing a file which I need to split based on the separator.
The following code shows the separators defined for the files I am processing
private static final String component = Character.toString((char) 31);
private static final String data = Character.toString((char) 29);
private static final String segment = Character.toString((char) 28);
Can someone please explain the significance of these specific separators?
Looking at the ASCII codes, these separators are file, group and unit separators. I don't really understand what this means.

Found this here. Cool website!
28 – FS – File separator The file
separator FS is an interesting control
code, as it gives us insight in the
way that computer technology was
organized in the sixties. We are now
used to random access media like RAM
and magnetic disks, but when the ASCII
standard was defined, most data was
serial. I am not only talking about
serial communications, but also about
serial storage like punch cards, paper
tape and magnetic tapes. In such a
situation it is clearly efficient to
have a single control code to signal
the separation of two files. The FS
was defined for this purpose.
29 – GS – Group separator
Data storage was one
of the main reasons for some control
codes to get in the ASCII definition.
Databases are most of the time setup
with tables, containing records. All
records in one table have the same
type, but records of different tables
can be different. The group separator
GS is defined to separate tables in a
serial data storage system. Note that
the word table wasn't used at that
moment and the ASCII people called it
a group.
30 – RS – Record separator
Within a group (or table) the records
are separated with RS or record
separator.
31 – US – Unit separator
The smallest data items to be stored
in a database are called units in the
ASCII definition. We would call them
field now. The unit separator
separates these fields in a serial
data storage environment. Most current
database implementations require that
fields of most types have a fixed
length. Enough space in the record is
allocated to store the largest
possible member of each field, even if
this is not necessary in most cases.
This costs a large amount of space in
many situations. The US control code
allows all fields to have a variable
length. If data storage space is
limited—as in the sixties—this is a
good way to preserve valuable space.
On the other hand is serial storage
far less efficient than the table
driven RAM and disk implementations of
modern times. I can't imagine a
situation where modern SQL databases
are run with the data stored on paper
tape or magnetic reels...

The ascii control characters range from 28-31. (0x1C to 0x1F)
31 Unit Separator
30 Record Separator
29 Group Separator
28 File Separator
Sample invocation:
char record_separator = 0x1F;
String s = "hello" + record_separator + "world"

These characters are control characters. They're not meant to be written or read by humans, but by computers. You should treat them in your program like any other character.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.