Read proto partly instead of full parsing in java

Read proto partly instead of full parsing in java - java

I used to define a proto file, for example
option java_package = "proto.data";
message Data {
repeated string strs = 1;
repeated int ints = 2;
}
I received from network this object's inputstream (or bytes). Then, normally, I do a parsing like Data.parserFrom(stream) or Data.parserFrom(bytes) to get the object.
By this, I have to hold full memory on Data object while I just need travel
all string and integer values in the object. It's bad when the object size is big.
What should I do for this issue?

Unfortunately, there is no way to parse just part of a protobuf. If you want to be sure that you've seen all of the strs or all of the ints, you have to parse the entire message, since the values could appear in any order or even interleaved.
If you only care about memory usage and not CPU time then you could, in theory, use a hand-written parser to parse the message and ignore fields that you don't care about. You still have to do the work of parsing, you can just discard them immediately rather than keeping them in memory. However, to do this you'd need to study the Protobuf wire format and write your own parser. You can use Protobuf's CodedInputStream class but a lot of work still needs to be done manually. The Protobuf library really isn't designed for this.
If you are willing to consider using a different protocol framework, Cap'n Proto is extremely similar in design to Protobufs but features in the ability to read only the part of the message you care about. Cap'n Proto incurs no overhead for the fields you don't examine, other than obviously the bandwidth and memory to receive the raw message bytes. If you are reading from a file, and you use memory mapping (MappedByteBuffer in Java), then only the parts of the message you actually use will be read from disk.
(Disclosure: I am the author of most of Google Protobufs v2 (the version you are probably using) as well as Cap'n Proto.)

Hmm. It appears that it may be already implemented but not adequately documented.
Has you tested it ?
See for discussion:
https://groups.google.com/forum/#!topic/protobuf/7vTGDHe0ZyM
See also, sample test code in google's github:
https://github.com/google/protobuf/blob/4644f99d1af4250dec95339be6a13e149787ab33/java/src/test/java/com/google/protobuf/lazy_fields_lite.proto

Related

Thrift - converting from simple JSON

I created the following Thrift Object:
struct Student{
1: string id;
2: string firstName;
3: string lastName
}
Now I would like to read this object from JSON. According to this post this is possible
So I wrote the following code:
String json = "{\"id\":\"aaa\",\"firstName\":\"Danny\",\"lastName\":\"Lesnik\"}";
StudentThriftObject s = new StudentThriftObject();
byte[] jsonAsByte = json.getBytes("UTF-8");
TMemoryBuffer memBuffer = new TMemoryBuffer(jsonAsByte.length);
memBuffer.write(jsonAsByte);
TProtocol proto = new TJSONProtocol(memBuffer);
s.read(proto);
What I'm getting is the following exception:
Exception in thread "main" org.apache.thrift.protocol.TProtocolException: Unexpected character:i
at org.apache.thrift.protocol.TJSONProtocol.readJSONSyntaxChar(TJSONProtocol.java:322)
at org.apache.thrift.protocol.TJSONProtocol.readJSONInteger(TJSONProtocol.java:698)
at org.apache.thrift.protocol.TJSONProtocol.readFieldBegin(TJSONProtocol.java:837)
at com.vanilla.thrift.example.entities.StudentThriftObject$StudentThriftObjectStandardScheme.read(StudentThriftObject.java:486)
at com.vanilla.thrift.example.entities.StudentThriftObject$StudentThriftObjectStandardScheme.read(StudentThriftObject.java:479)
at com.vanilla.thrift.example.entities.StudentThriftObject.read(StudentThriftObject.java:413)
at com.vanilla.thrift.controller.Main.main(Main.java:24)
Am I missing something?

You are missing the fact, that Thrift's JSON is different from yours. The field names are not written, instead the assigned field ID numbers are written (and expected). Here's an example for Thrift's JSON protocol:
[1,"MyService",2,1,{"1":{"rec":{"1":{"str":"Error: Process() failed"}}}}]
In other words, Thrift is not intended to parse any kind of JSON. It supports a very specific JSON format as one of the possible transports.
However, depending on what the origin of your JSON data is, Thrift can possibly still help you out, if you are able to use it on both sides. In that case, write an IDL to describe the data structures, feed it to the Thrift compiler and integrate both the generated code and the neccessary parts of the library with your projects.
If the origin of the JSON lies outside of your reach, or if the JSON format cannot be changed for some reason, you need to find another way.
Format and semantics are different beasts
To some extent, the whole issue can be compared with XML: There is one general XML syntax, which tells us how we have to fomat things so any standard conformant XML processor can read them.
But knowing the rules of XML is only half the answer, if we get a certain XML file from someone. Even if our XML parser can read the file successfully, because it is well-formed XML, we need to know the semantics of the data to really make use of what's within that file: Is it a customer data record? Or is it a SOAP envelope? Maybe a configuration file?
That is where DTDs or XML Schema come into play, they exist to describe the contents of the XML data. Without knowing the logical structure you are lost, because there are myriads of possible ways to express things in XML. And exactly the same is true with JSON, except that JSON schema descriptions are less commonly used.
"So you mean, we need just a way to tell Thrift how the JSON is organized?"
No, because the purpose and idea behind Thrift is to have a framework to de/serialize things and/or implement RPC servers and clients as efficiently as possible. It is not intended to have a general purpose file parser. Instead, Thrift reads and speaks only its own set of formats, which are plugged into the architecture as protocols: Thrift Binary, Thrift JSON, Thrift Compact, and a few more.
What you could do: In addition to what I said at in the first section of my answer, you may consider writing your own custom Thrift protocol implementation to support your particular JSON format of choice. It is not that hard, and worth a try.

Best file format regarding standard string and integer data?

For my project, I need to store info about protocols (the data sent (most likely integers) and in the order it's sent) and info that might be formatted something like this:
'ID' 'STRING' 'ADDITIONAL INTEGER DATA'
This info will be read by a Java program and stored in memory for processing, but I don't know what would be the most sensible format to store this data in?
EDIT: Here's some extra information:
1)I will be using this data in a game server.
2)Since it is a game server, speed is not the primary concern, since this data will primary be read and utilized during startup, which shouldn't occur very often.
3)Memory consumption I would like to keep at a minimum, however.
4)The second data "example" will be used as a "dictionary" to look up names of specific in-game items, their stats and other integer data (and therefore might become very large, unlike the first data containing the protocol information, where each file will only note small protocol bites, like a login protocol for instance).
5)And yes, I would like the data to be "human-editable".
EDIT 2: Here's the choices that I've made:
JSON - For the protocol descriptions
CSV - For the dictionaries

There are many factors that could come to weigh--here are things that might help you figure this out:
1) Speed/memory usage: If the data needs to load very quickly or is very large, you'll probably want to consider rolling your own binary format.
2) Portability/compatibility: Balanced against #1 is the consideration that you might want to use the data elsewhere, with programs that won't read a custom binary format. In this case, your heavy hitters are probably going to be CSV, dBase, XML, and my personal favorite, JSON.
3) Simplicity: Delimited formats like CSV are easy to read, write, and edit by hand. Either use double-quoting with proper escaping or choose a delimiter that will not appear in the data.
If you could post more info about your situation and how important these factors are, we might be able to guide you further.

How about XML, JSON or CSV ?

I've written a similar protocol-specification using XML. (Available here.)
I think it is a good match, since it captures the hierarchal nature of specifying messages / network packages / fields etc. Order of fields are well defined and so on.
I even wrote a code-generator that generated the message sending / receiving classes with methods for each message type in XSLT.
The only drawback as I see it is the verbosity. If you have a really simple structure of the specification, I would suggest you use some simple home-brewed format and write a parser for it using a parser-generator of your choice.

In addition to the formats suggested by others here (CSV, XML, JSON, etc.) you might consider storing the info in a Java properties file. (See the java.util.Properties class.) The code is already there for you, so all you have to figure out is the properties names (or name prefixes) you want to use.
The Properties class also provides for storing/loading properties in a simple XML format.

Serialization vs String format

Can any one suggest which way is better?
Storing the object in serialized form or read the filecontent as String and construct the object.
Simply,
1.I have a string (str,str1,str2,str3,....) like this in my filestore.
Read this file string and construct java object (ex creating the Linkedlist obj based on the comma separated).
2.Retrieve the Linkedlist obj from the file store using the serialization.
Reading the serialized object from filestore or construct the obj from string.
Which one is the best way?
i am taking the linkedlist here is just for sample.
It may be differ, from the string i have to construct some JSONObject,JsonArray formats...
JSON is not serialized obj, i will do it some other way to make as serializable.
For a lengthy string which one is best, serialize or construct the obj from string?
All thing are related to Java
Please advice me
Regards
S.Chinna

The advantage of using a text format is that you can read and maintain the data in a simple text editor.
The advantage of using a binary format like Object Serialization is you don't have to worry about seperators e.g. what if a string contains a ,
Either approach you suggest is likely to be efficient enough (though I would use an ArrayList)
EDIT: If you have multiple strings a better approach may be to put them on a seperate line each. This way you don't need to worry about ,, and can read/edit/version the file easier.
List<String> list = FileUtils.readLines(file);
As you can see, you would be able to read the entire file in one line.

It depends on the complexity of the objects you have to store. If they are simple, or if you have the time to write to write an own Writer and Reader for your objects, I would always go with a custom text format, because they are most the most easy to debug.
If you have a server understanding text commands, you could even connect with putty or telnet and test your server!
But if you have to transport complex object structures, that might even change during development, I would definitely go with some form of serialization. Please note here that Javas default serialization is NOT a good candidate for a communication protocol, because of the large overhead they produce in defining classes over and over again. Better go with JBossSerialization if you want something API compatible to Javas build in classes, or go with JSON, if you don't have to transport much binary data.

Well, if you care about speed - go for binary serialization. If you want to easily read serialized objects - go for string-based (json for example). And here is a performance test for various serializers:
http://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2

The advantages of Java ObjectStream serialization are:
You have minimal code to write, and minimal thinking to do when designing your serialization format.
Dealing with complicated (graph-structured) networks of objects is simple.
The end result should be type-safe and bug-free (assuming that you don't implement your own custom read/write object methods, etc)
The main disadvantage of Java ObjectStream serialization is that it is fragile in the face of changes to the classes that you've serialized. Dealing with this can be difficult. (By contrast, a hand-parsed text format is largely immune to this issue ... and problems are easier to fix.)

JAVA: gathering byte offsets of xml tags using an XmlStreamReader

Is there a way to accurately gather the byte offsets of xml tags using the XMLStreamReader?
I have a large xml file that I require random access to. Rather than writing the whole thing to a database, I would like to run through it once with an XMLStreamReader to gather the byte offsets of significant tags, and then be able to use a RandomAccessFile to retrieve the tag content later.
XMLStreamReader doesn't seem to have a way to track character offsets. Instead people recommend attaching the XmlStreamReader to a reader that tracks how many bytes have been read (the CountingInputStream provided by apache.commons.io, for example)
e.g:
CountingInputStream countingReader = new CountingInputStream(new FileInputStream(xmlFile)) ;
XMLStreamReader xmlStreamReader = xmlStreamFactory.createXMLStreamReader(countingReader, "UTF-8") ;
while (xmlStreamReader.hasNext()) {
int eventCode = xmlStreamReader.next();
switch (eventCode) {
case XMLStreamReader.END_ELEMENT :
System.out.println(xmlStreamReader.getLocalName() + " #" + countingReader.getByteCount()) ;
}
}
xmlStreamReader.close();
Unfortunately there must be some buffering going on, because the above code prints out the same byte offsets for several tags. Is there a more accurate way of tracking byte offsets in xml files (ideally without resorting to abandoning proper xml parsing)?

You could use getLocation() on the XMLStreamReader (or XMLEvent.getLocation() if you use XMLEventReader), but I remember reading somewhere that it is not reliable and precise. And it looks like it gives the endpoint of the tag, not the starting location.
I have a similar need to precisely know the location of tags within a file, and I'm looking at other parsers to see if there is one that guarantees to give the necessary level of location precision.

You could use a wrapper input stream around the actual input stream, simply deferring to the wrapped stream for actual I/O operations but keeping an internal counting mechanism with assorted code to retrieve current offset?

Unfortunatly Aalto doesn't implement the LocationInfo interface.
The last java VTD-XML ximpleware implementation, currently 2.11
on sourceforge or on github
provides some code maintaning a byte offset after each call to
the getChar() method of its IReader implementations.
IReader implementations for various caracter encodings
are available inside VTDGen.java and VTDGenHuge.java
IReader implementations are provided for the following encodings
ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258
Updating IReader with a getCharOffset() method
and implementing it
by adding a charCount member along to the offset member of the
VTDGen and VTDGenHuge classes
and by incrementing it upon each getChar() and skipChar() call of each IReader implementation should give you the start of a solution.

I think I've found another option. If you replace your switch block with the following, it will dump the position immediately after the end element tag.
switch (eventCode) {
case XMLStreamReader.END_ELEMENT :
System.out.println(xmlStreamReader.getLocalName() + " end#" + xmlStreamReader.getLocation().getCharacterOffset()) ;
}
This solution also would require that the actual start position of the end tags would have to be manually calculated, and would have the advantage of not needing an external JAR file.
I was not able to track down some minor inconsistencies in the data management (I think it has to do with how I initialized my XMLStreamReader), but I always saw a consistent increase in the location as the reader moved through the content.
Hope this helps!

I recently worked out a solution for a similar question on How to find character offsets in big XML files using java?. I think it provides a good solution based on a ANTLR generated XML-Parser.

I just burned a day long weekend on this, and arrived at the solution partially thanks to some clues here. Remarkably I don't think this has gotten much easier in the 10 years since the OP posted this question.
TL;DR Use Woodstox and char offsets
The first problem to contend with is that most XMLStreamReader implementations seem to provide inaccurate results when you ask them for their current offsets. Woodstox however seems to be rock-solid in this regard.
The second problem is the actual type of offset you use. Unfortunately it seems that you have to use char offsets if you need to work with a multi-byte charset, which means the random-access retrieval from the file is not going to be very efficient - you can't just set a pointer into the file at your offset and start reading, you have to read through until you get to the offset, then start extracting. There may be a more efficient way to do this that I haven't though of, but the performance is acceptable for my case. 500MB files are pretty snappy.
[edit] So this turned into one of those splinter-in-my-mind things, and I ended up writing a FilterReader that keeps a buffer of byte offset to char offset mappings as the file is read. When we need to get the byte offset, we first ask Woodstox for the char offset, then get the custom reader to tell us the actual byte offset for the char offset. We can get the byte offset from the beginning and end of the element, giving us what we need to go in and surgically extract the element from the file by opening it as a RandomAccessFile.
I created a library for this, it's on GitHub and Maven Central. If you just want to get the important bits, the party trick is in the ByteTrackingReader.
[/edit]
There is another similar question on SO about this (but the accepted answer frightened and confused me), and some people commented about how this whole thing is a bad idea and why would you want to do it? XML is a transport mechanism, you should just import it to a DB and work with the data with more appropriate tools. For most cases this is true, but if you're building applications or integrations that communicate via XML (still going strong in 2020), you need tooling to analyze and operate on the files that are exchanged. I get daily requests to verify feed contents, having the ability to quickly extract a specific set of items from a massive file and verify not only the contents, but the format itself is essential.
Anyhow, hopefully this can save someone a few hours, or at least get them closer to a solution. God help you if you're finding this in 2030, trying to solve the same problem.

Best way to serialize a C structure to be deserialized by Java, etc

Currently, I'm saving and loading some data in C/C++ structs to files by using fread()/fwrite(). This works just fine when working within this one C app (I can recompile whenever the structure changes to update the sizeof() arguments to fread()/fwrite()), but how can I load this file in other programs without knowing in advance the sizeof()s of the C struct?
In particular, I have written this other Java app that visualizes the data contained in that C struct binary file, but I'd like a general solution as to how read that binary file. (Instead of me having to manually put in the sizeof()s in the Java app source whenever the C structure changes...)
I'm thinking of serializing to text or XML of some sort, but I'm not sure where to start with that (how to serialize in C, then how to deserialize in Java and possibly other languages in the future), and if that is advisable here where one member of the struct is a float array that can go upwards of ~50 MB in binary format (and I have hundreds of these data files to read and write).
The C structure is simple (no severe nesting or pointer references) and looks like the following:
struct MyStructure {
char *title;
int id;
int param1;
int param2;
float *data;
}
The part that are liable to change the most are the param integers.
What are my options here?

If you have control of both code bases, you should consider using Protocol Buffers.

You could use Java's DataInput/DataOutput format that is well described in the javadoc.

Take a look at JSON. http://www.json.org. If you go to from javascript it's a big help. I don't know how good the java support is though.

If your structure isn't going to change (much), and your data is in a pretty consistent format, you could just write the values out to a CSV file, or some other plain format.
This can be easily read in Java, and you won't have to worry about serializing to XML. Sometimes going simple is the easiest route.

Take a look at Resin's Hessian/Burlap services. You may not want the whole service, just part of the API and an understanding of the wire protocol.

If:
your data is essentially a big array of floats;
you are able to test the writing/reading procedure in all the likely environments (=combinations of machines/OS/C compiler) that each end will be running on;
performance is important.
then I would probably just keep writing the data from C in the way that you are doing (maybe with a slight amendment -- see below) and turn the problem into how you read that data from Java.
To read the data back in from Java, use a ByteBuffer. Essentially, pull in slabs of bytes from your data, wrap a ByteBuffer around them, and then use the get(), getFloat(), getInt() etc methods. The NIO package also has "wrapper" buffers, e.g. FloatBuffer, which from tests I've done appear to be about 20% faster for reading large numbers of the same type.
Now, one thing you'll have to be careful about is byte ordering. From Java, you need to call order(ByteOrder.LITTLE _ ENDIAN) or order(ByteOrder.BIG _ ENDIAN) on your buffer before you start reading the data. To decide which to use, I'd recommend that at the very start of the stream, you write some known 16-byte value (e.g. 255 = 0x00ff). Then from Java, pull out these two bytes and check the order (0xff, 0x00 or 0x00, 0xff) to see whether you have little or big endian.

One possibility is creating small XML files with title, ID, params, etc, and then a reference (by filename) to where the float data is contained. Assuming there's nothing special about the float data, and that Java and C are using the same floating point format, you can read that file in with readFloat() of a DataInputStream.

I like the CSV and "Protocol Buffers" answers (though, at a glance, the protocol buffer thing might be very similar to YAML for all I know).
If you need tightly packed records for high volume data, you might consider this:
Create a textual file header describing the current file structure: record sizes (types????) and field names / sizes. Read and parse the header, then use low level binary I/O operations to load up each record's fields, er, object's properties or whatever we are calling it this year.
This gives you the ability to change the strucutre a bit and have it be self-describing, while still allowing you to pack a high volume in a smaller space than XML would allow.
TMTOWTDI, I guess.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.