Protobuf lazy decoding of sub message

Protobuf lazy decoding of sub message - java

I am using proto 3 (java) in my projects . I have some huge protobufs embedded with smaller messages . Is there a way I can acheive partial decoding of only few nested sub messages that I want to look at. The current issue I am having is I need to join this huge proto based record data with another records ,but my join are based on very small sub messages ,so I don't want to decode the entire huge protobuf and be able to only decode the nested message (string id) to join and then only decode the entire protobuf for the joined data.
I tried using the [lazy=true] tagging method , but I don't see any difference in generated code , also I tried benchmarking the deserialization time with and without the lazy key work and it didn't seem to affect at all . Is this feature by default on for all fields? Or is this even possible? I do see there are few classes LazyFields.java and test cases in the protobuf-github so I assume this feature has been implemented.

For those that happen to look at this conversation later and finding it hard to understand, here's what Marc's talking about:
If your object is something like
message MyBigMessage{
string id = 1;
int sourceType = 2 ;
And many other fields here, that would be expensive to parse .......
}
And you get a block of bytes that you have to parse. But you want to only parse messages from a certain source and maybe match a certain id range.
You could first parse those bytes with another message as:
message MyFilterMessage{
string id = 1; //has to be 1 to match
int sourceType = 2 ; //has to be 1 to match
And NOTHING ELSE here.......
}
And then, you could look at sourceType and id. If they match whatever you are filtering for, then, you could go and parse the bytes again, but this time, using MyBigMessage to parse the whole thing.
One other thing to know:
FYI: As of 2017, lazy parsing was disabled in Java (except MessageSet) according to this post:
https://github.com/protocolbuffers/protobuf/issues/3601#issuecomment-341516826
I dont know the current status. Too lazy to try to find out ! :-)

Related

Customize Json Patch as per the requiremnent

I am using Json Patch library to perfrom a Patch operation using REST. Now I have the follwoing json document:
{
"id":1,
"ref":{"r1":1,"r2":2}, // header level
"child":[
{
"childId":1,
"ref":{"cc1":1,"cc2":2} // line level
},
{
"childId":2,
"ref":{"cc3":2} // line level
}
]
}
Now As per Json Patch doc we at the header level we can update the ref r1 using the following path /ref/r1 .
Now I am trying to perform operation on the line level child ref. Since child is an array I can use the path /child/0/ref/cc1. But as can be seen from the path I have to specify the index also which is 0 in the previous case.
Now for API consumers asking them to give the index of the array become difficult. So is there any way to customize json patch so that we can bypass the index requirement or what are the other ways to handle this scenario?

I'm not an expert in JSON-Patch, i've just read about it.
from what i understood, is the most important part is to let the API consumers access to your JSON without giving them index,
I think hashmap would help in this case, by getting the index of each element and generate a specific ID for it, then you can save them in the hashmap list, each index has its own ID.
a sample:
HashMap<String, String> elementIndex = new HashMap<[UUID], [elementIndex]>();
you can choose whatever DataType you want, not necessary String
In this case it doesn't matter which index number, it is all about the fixed UUID.
So the path will be in this case /child/{UUID}/ref/cc1 also when you receive the path you can access the UUID and replace it with its elementIndex, now you have the correct path which is /child/0/ref/cc1
and if you want to know how to pass a dynamic value to a JSON Object, there are multiple ways to do it,
this question will help:
How to pass dynamic value to a JSON String, -Convert the JSONObject to String before-
NOTE: It is not necessary to replace it with index, you can do it the way you like could be.
And i believe there are better answers if someone knows more about JSON-patch.
i hope that was helpful, or at least gives you an idea about how to solve it.

How to parse freedict files (.dict and .index)

I was searching for free translation dictionaries. Freedict (freedict.org) provides the ones I need but I don't know, how to parse the *.index and *.dict files. I also don't really know, what to google, to find useful information about these formats.
The *.index files look following:
00databasealphabet QdGI l
00databasedictfmt1121 B b
00databaseinfo c 5o
00databaseshort 6E u
00databaseurl 6y c
00databaseutf8 A B
a BHO M
a bad risc BHa u
a bag of nerves BII 2
[...]
and the *.dict files:
[Lot of info stuff]
German-English FreeDict Dictionary ver. 0.3.4
Pipi machen /piːpiːmaxən/
to pee; to piss
(Aktien) zusammenlegen /aktsiːəntsuːzamənleːgən/
to merge (with)
[...]
I would be glad to see some example projects (preferably in python, but java, c, c++ are also ok) to understand how to handle these files.

It is too late. However, i hope that it can be useful for others like me.
JGoerzen writes a Dictdlib lib. You can see more details how he parse .index and .dict files.
https://github.com/jgoerzen/dictdlib/blob/master/dictdlib.py

dictd considers its format of .index and .dict[.dz] as private, to reserve itself the right to change it in the future.
If you want to process it directly anyway, the index contains the headwords and the .dict[.dz] contains definitions. It is optionally compressed with a special modified gzip algorithm providing almost random access, which gzip normally does not. The index contains 3 columns per line, tab separated:
The headword for looking up the definition.
The absolute byte position of the definition in the .dict[.dz] file, base64 encoded.
The length of the definition in bytes, base64 encoded.
For more details see the dict(8) man page (section Database Format) you should have found in your research before asking your question. For processing the headwords correctly, you'd have to consider encoding and character collation.
Eventually it would be better to use an existing library to read dictd databases. But that really depends on whether the library is good (no experience here).
Finally, as you noted yourself, XML is made exactly for easy processing. You could extract the headwords and translations using XPath, leaving out all the grammatical stuff and no need to bother parsing anything.
After getting this far the next problem would be that there is no one-to-one mapping between words in different lanuages...

How to write/read binary files that represent objects?

I'm new to Java programming, and I ran into this problem:
I'm creating a program that reads a .csv file, converts its lines into objects and then manipulate these objects.
Being more specific, the application reads every line giving it an index and also reads certain values from those lines and stores them in TRIE trees.
The application then can read indexes from the values stored in the trees and then retrieve the full information of the corresponding line.
My problem is that, even though I've been researching the last couple of days, I don't know how to write these structures in binary files, nor how to read them.
I want to write the lines (with their indexes) in a binary indexed file and read only the exact index that I retrieved from the TRIEs.
For the tree writing, I was looking for something like this (in C)
fwrite(tree, sizeof(struct TrieTree), 1, file)
For the "binary indexed file", I was thinking on writing objects like the TRIEs, and maybe reading each object until I've read enough to reach the corresponding index, but this probably wouldn't be very efficient.
Recapitulating, I need help in writing and reading objects in binary files and solutions on how to create an indexed file.

I think you are (for starters) best off when trying to do this with serialization.
Here is just one example from stackoverflow: What is object serialization?
(I think copy&paste of the code does not make sense, please follow the link to read)
Admittedly this does not yet solve your index creation problem.

Here is an alternative to Java native serialization, Google Protocol Buffers.
I am going to write direct quotes from documentation mostly in this answer, so be sure to follow the link at the end of answer if you are interested into more details.
What is it:
Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.
In other words, you can serialize your structures in Java and deserialize at .net, pyhton etc. This you don't have in java native Serialization.
Performance:
This may vary according to use case but in principle GPB should be faster, as its built with performance and interchangeability in mind.
Here is stack overflow link discussing Java native vs GPB:
High performance serialization: Java vs Google Protocol Buffers vs ...?
How does it work:
You specify how you want the information you're serializing to be structured by defining protocol buffer message types in .proto files. Each protocol buffer message is a small logical record of information, containing a series of name-value pairs. Here's a very basic example of a .proto file that defines a message containing information about a person:
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}
Once you've defined your messages, you run the protocol buffer compiler for your application's language on your .proto file to generate data access classes. These provide simple accessors for each field (like name() and set_name()) as well as methods to serialize/parse the whole structure to/from raw bytes.
You can then use this class in your application to populate, serialize, and retrieve Person protocol buffer messages. You might then write some code like this:
Person john = Person.newBuilder()
.setId(1234)
.setName("John Doe")
.setEmail("jdoe#example.com")
.build();
output = new FileOutputStream(args[0]);
john.writeTo(output);
Read all about it here:
https://developers.google.com/protocol-buffers/
You could look at GPB as an alternative format to XSD describing XML structures, just more compact and with faster serialization.

Antlr3 next available tokens when parsing incomplete statement

I an trying to implement simple auto complete for a command line client for SQL. I am using antlr for generating a parser in the rest of the application and I wanted to reuse the grammar to use autocompletion. My idea is:
- Parse the incomplete statement when the user asks for completion (for example select a from)
- Get from the parser the list of tokens which were expected when he raised a NoViableAltException
I wanted to then do from this list of token:
if (isreserved_word) { propose for completion}
else { notify the user a identifier is expected}
This in principle looked like a sensible idea (to me at least) and I found this:
http://www.antlr.org/wiki/pages/viewpage.action?pageId=11567208 which convinced me it was possible
However, after doing some testing I realized not that many tokens were in state.following[state._fsp]
for example for an entry of create it was only containing ';'
when my grammar for this part looks like:
root : statement? (SEMICOLON!)? EOF!;
statement : create | ...;
create : CREATE | ( TABLE table_create | USER user_create | ....);
So I was confused and looked at the generated code:
try {
int alt6=16;
alt6 = dfa6.predict(input);
switch (alt6) {
case 1 :
{
root_0 = (CommonTree)adaptor.nil();
pushFollow(FOLLOW_create_in_statement1088);
create8=create();
state._fsp--;
adaptor.addChild(root_0, create8.getTree());
}
break;
case 2 :
...
So then it made sense to me: The parser tries to read the next token and then from this token finds(switch case) the next rule.
In my case the predict just fails as there no next token.
So from there I understood I would need to hack a bit antlr and looked in the templates and in Java.stg I found these pieces of code:
/** A (...) subrule with multiple alternatives */
block(alts,decls,decision,enclosingBlockLevel,blockLevel,decisionNumber,maxK,maxAlt,description) ::= <<
// <fileName>:<description>
int alt<decisionNumber>=<maxAlt>;
<decls>
<#predecision()>
<decision>
<#postdecision()>
<#prebranch()>
switch (alt<decisionNumber>) {
<alts:{a | <altSwitchCase(i,a)>}>
}
<#postbranch()>
>>
and
/** A case in a switch that jumps to an alternative given the alternative
* number. A DFA predicts the alternative and then a simple switch
* does the jump to the code that actually matches that alternative.
*/
altSwitchCase(altNum,alt) ::= <<
case <altNum> :
<#prealt()>
<alt>
break;<\n>
>>
From there all I thought I had to do was to do my own function that would just put all altNum in a stack before the call to predict. so I did to try:
/*
Yout }>*/
And I was expecting to get nice little lists of token ids. But not at all I get really different things.
So I'm really lost and would like to know if either there was an easier way to provide this autocomplete feature without having to do it by hand or how am I missing to modify the template to add a custom stack to add the different alternatives in a rule so I can read it after when the exception is raised
Thank you very much

Sorry to say this but: don't use a parser directly for auto completion. There are several reasons why this won't work as you expect it, without massive manual changes in the generated parser (which requires intimate knowledge):
You often have incomplete input and unless you have only a simple language you will often find yourself in an unexpected rule path because of the backtracking nature of the parser. For instance, if you have several alts in a rule where the first alt would match if only an additional token were available the parser will fail not before it has tried all other alts giving you either completely different tokens or many more tokens than are really necessary.
The follow set is only available in an error case. However there might be no error or there is an error but at a completely different position than where the caret is currently (and where the user would expect an auto completion box).
The follow set suffices only for a small set of info you want to present (namely the keywords). However, usually you want to show, say, possible tables in a database if you are in a FROM clause (assuming an SQL language here). You will not get this type of info from the parser, simply because the parser does not have such context information. What you however get is 'identifier' which can be anything from a table, function name, variable or similar.
My current approach for this type of problem is to tokenize the input and apply domain knowledge in a decision tree. That is, I walk the input tokens and decide based on knowledge I have from the grammar what would be the most important stuff to show.

Incremental streaming JSON library for Java

Can anyone recommend a JSON library for Java which allows me to give it chunks of data as they come in, in a non-blocking fashion? I have read through A better Java JSON library and similar questions, and haven't found precisely what I'd like.
Essentially, what I'd like is a library which allows me to do something like the following:
String jsonString1 = "{ \"A broken";
String jsonString2 = " json object\" : true }";
JSONParser p = new JSONParser(...);
p.parse(jsonString1);
p.isComplete(); // returns false
p.parse(jsonString2);
p.isComplete(); // returns true
Object o = p.getResult();
Notice the actual key name ("A broken json object") is split between pieces.
The closest I've found is this async-json-library which does almost exactly what I'd like, except it cannot recover objects where actual strings or other data values are split between pieces.

There are a few blocking streaming/incemental JSON parsers (as per Is there a streaming API for JSON?); but for async nothing yet that I am aware of.
The lib you refer to seems badly named; it does not seem to do real asynchronous processing, but merely allow one to parse sequence of JSON documents (which multiple other libs allow doing as well)
If there were people who really wanted this, writing one is not impossible -- for XML there is Aalto, and handling JSON is quite a bit simpler than XML.
For what it is worth, there is actually this feature request to add non-blocking parsing mode for Jackson; but very few users have expressed interest in getting that done (via voting for the feature request).
EDIT: (2016-01) while not async, Jackson ObjectMapper allows for convenient sub-tree by sub-tree binding of parts of the stream as well -- see ObjectReader.readValues() (ObjectReader created from ObjectMapper), or short-cut versions of ObjectMapper.readValues(...). Note the trailing s in there, which implies a stream of Objects, not just a single one.

Google Gson can incrementally parse Json from an InputStream
https://sites.google.com/site/gson/streaming

I wrote such a parser: JsonParser.java. See examples how to use it:JsonParserTest.java.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.