Count the number of messages in protocol buffer file

Count the number of messages in protocol buffer file - java

I am currently writing the java code to check the number of messages in a protocol buffer file *.pb
I would like to know if there is meta-data or header that contains the information of number of messages in the protobuf file?
I am looping through the whole file, and I think there should be a better way to do it.
while ((m = message.getParserForType().parseDelimitedFrom(input)) != null) {
recordCount++;
}
Thanks
David

There is no header or anything that will tell you the number of messages in the file. That format just consists of a length prefix in varint format, followed by a message payload, repeated for as many messages as you have.
However, you could in principle count the number of messages in a much more efficient way. If you just want to know how many there are, you could read the length prefixes and skip over the actual message payloads without parsing them.

Related

How to write/read binary files that represent objects?

I'm new to Java programming, and I ran into this problem:
I'm creating a program that reads a .csv file, converts its lines into objects and then manipulate these objects.
Being more specific, the application reads every line giving it an index and also reads certain values from those lines and stores them in TRIE trees.
The application then can read indexes from the values stored in the trees and then retrieve the full information of the corresponding line.
My problem is that, even though I've been researching the last couple of days, I don't know how to write these structures in binary files, nor how to read them.
I want to write the lines (with their indexes) in a binary indexed file and read only the exact index that I retrieved from the TRIEs.
For the tree writing, I was looking for something like this (in C)
fwrite(tree, sizeof(struct TrieTree), 1, file)
For the "binary indexed file", I was thinking on writing objects like the TRIEs, and maybe reading each object until I've read enough to reach the corresponding index, but this probably wouldn't be very efficient.
Recapitulating, I need help in writing and reading objects in binary files and solutions on how to create an indexed file.

I think you are (for starters) best off when trying to do this with serialization.
Here is just one example from stackoverflow: What is object serialization?
(I think copy&paste of the code does not make sense, please follow the link to read)
Admittedly this does not yet solve your index creation problem.

Here is an alternative to Java native serialization, Google Protocol Buffers.
I am going to write direct quotes from documentation mostly in this answer, so be sure to follow the link at the end of answer if you are interested into more details.
What is it:
Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.
In other words, you can serialize your structures in Java and deserialize at .net, pyhton etc. This you don't have in java native Serialization.
Performance:
This may vary according to use case but in principle GPB should be faster, as its built with performance and interchangeability in mind.
Here is stack overflow link discussing Java native vs GPB:
High performance serialization: Java vs Google Protocol Buffers vs ...?
How does it work:
You specify how you want the information you're serializing to be structured by defining protocol buffer message types in .proto files. Each protocol buffer message is a small logical record of information, containing a series of name-value pairs. Here's a very basic example of a .proto file that defines a message containing information about a person:
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}
Once you've defined your messages, you run the protocol buffer compiler for your application's language on your .proto file to generate data access classes. These provide simple accessors for each field (like name() and set_name()) as well as methods to serialize/parse the whole structure to/from raw bytes.
You can then use this class in your application to populate, serialize, and retrieve Person protocol buffer messages. You might then write some code like this:
Person john = Person.newBuilder()
.setId(1234)
.setName("John Doe")
.setEmail("jdoe#example.com")
.build();
output = new FileOutputStream(args[0]);
john.writeTo(output);
Read all about it here:
https://developers.google.com/protocol-buffers/
You could look at GPB as an alternative format to XSD describing XML structures, just more compact and with faster serialization.

Byte length difference when retrieving from WebSphere MQ Message

In Java, I am polling a WebSphere MQ message queue, expecting a message of `STRING format, that is composed entirely of XML. Part of this XML will contain bytes to a file attachment (any format: pdf, image, etc) which will then be converted to a blob for storage in an Oracle Db and later retrieval.
The issue I am having is that the known size of example files being sent over end up in my Db with a different size. I am not adding anything to the bytes (as far as I know), and the size appears to be larger directly after I get the message. I cannot determine if I am somehow adding information at retrieve, conversion from bytes -> String, or if this is happening on the front end when the sender populates the message.
My code at retrieve of the message:
inboundmsg = new MQMessage();
inboundmsg = getMQMsg(FrontIncomingQueue, gmo);
strLen = inboundmsg.getMessageLength();
strData = new byte[strLen];
ibm_id = inboundmsg.messageId;
inboundmsg.readFully(strData);
inboundmsgContents = new String(strData);
I see a file known to have size 21K go to 28K. A coworker has suggested that charset/encoding may be the issue. I do not specify a charset in the constructor call to String above, nor in any of the calls to getBytes when converting back from a string (for other unrelated uses). My default charset is ISO-8859-1. When speaking with the vendor who is initiating the message transfer, I asked her what charset she is using. Her reply:
"I am using the File.WriteAllBytes method in C# - I pass it the path of my file and it writes it to a byte[]. I could not find any documentation on the MSDN about what encoding the function uses. The method creates a byte array and from what I have read online this morning there is no encoding, its just a sequence of 8bit unsigned binary data with no encoding."
Another coworker has suggested that perhaps the MQ charset is the culprit, but my reading of the documentation suggests that MQ charset only affects the behavior of readString, readLine, & writeString.
If I circumvent MQ totally, and populate a byte array using a File Input Stream and a local file, the file size is preserved all the way to Db store, so this definitely appears to be happening at or during message transfer.

The problem is evident in the wording of the question. You describe a payload that contains arbitrary binary data and also trying to process it as a string. These two things are mutually exclusive.
This appears to be complicated by the vendor not providing valid XML. For example, consider the attachment:
<PdfBytes>iVBORw0KGgoAAAANS … AAAAASUVORK5CYII=</PdfBytes>
If the attachment legitimately contains any XML special character such as < or > then the result is invalid XML. If it contains null bytes, some parsers assume they have reached the end of the text and stop parsing there. That is why you normally see any attachment in XML either converted to Base64 for transport or else converted to hexadecimal.
The vendor describes writing raw binary data which suggests that what you are receiving contains non-string characters and therefore should not be sent as string data. If she had described some sort of conversion that would make the attachment XML compliant then string would be appropriate.
Interestingly, a Base64 encoding results in a payload that is 1.33 times larger than the original. Coincidence that 21k * 1.3 = 28k? One would think that what is received is actually the binary payload in Base64 format. That actually would be parseable as a string and accounts for the difference in file sizes. But it isn't at all what the vendor described doing. she said she's writing "8bit unsigned binary data with no encoding" and not Base64.
So we expect it to fail but not necessarily to result in a larger payload. Consider that WebSphere MQ receiving a message in String format will attempt to convert it. If the CCSID of the message differs from that requested on the GET then MQ will attempt a conversion. If the inbound CCSID is UTF-16 or any double-byte character set, certain characters will be expanded from one to two bytes - assuming the conversion doesn't hit invalid binary characters that cause it to fail.
If the two CCSIDs are the same then no conversion is attempted in the MQ classes but there is still an issue in that something has to parse an XML payload that is by definition not valid and therefore subject to unexpected results. If it happens that the binary payload does not contain any XML special characters and the parser doesn't choke on any embedded null bytes, then the parser is going to rather heroic lengths to forgive the non-compliant payload. If it gets to the </PdfBytes> tag without choking, it may assume that the payload is valid and convert everything between the <PdfBytes>...</PdfBytes> tags itself. Presumably to Base64.
All of this is conjecture, of course. But in a situation where the payload is unambiguously not string data any attempt to parse it as string data will either fail outright or produce unexpected and potentially bizarre results. You are actually unfortunate that it doesn't fail outright because now there's an expectation that the problem is on your end when it clearly appears to be the vendor's fault.
Assuming that the content of the payload remains unchanged, the vendor should be sending bytes messages and you should be receiving them as bytes. That would at least fix the problems MQ is having reconciling the expected format with the actual received format, but it would still be invalid XML. If it works that the vendor sends binary data in a message set to type String with you processing it as bytes then count your blessings and use it that way but don't count on it being reliable. Eventually you'll get a payload with an embedded XML special character and then you will have a very bad day.
Ideally, the vendor should know better than to send binary data in an XML payload without converting it first to string and it is up to them to fix it so that it is compliant with the XML spec and reliable.
Please see this MSDN page: XML, SOAP, and Binary Data

Can I add a binary file to a String based server message queue?

I have a multi-threaded client-server application that uses Vector<String> as a queue of messages to send.
I need, however, to send a file using this application. In C++ I would not really worry, but in Java I'm a little confused when converting anything to string.
Java has 2 byte characters. When you see Java string in HEX, it's usually like:
00XX 00XX 00XX 00XX
Unless some Unicode characters are present.
Java also uses Big endian.
These facts make me unsure, whether - and eventually how - to add the file into the queue. Preferred format of the file would be:
-- Headers --
2 bytes Size of the block (excluding header, which means first four bytes)
2 bytes Data type (text message/file)
-- End of headers --
2 bytes Internal file ID (to avoid referring by filenames)
2 bytes Length of filename
X bytes Filename
X bytes Data
You can see I'm already using 2 bytes for all numbers to avoid some horrible operations required when getting 2 numbers out of one char.
But I have really no idea how to add the file data correctly. For numbers, I assume this would do:
StringBuilder packetData = new StringBuilder();
packetData.append((char) packetSize);
packetData.append((char) PacketType.BINARY.ordinal()); //Just convert enum constant to number
But file is really a problem. If I have also described anything wrongly regarding the Java data types please correct me - I'm a beginner.

Does it have to send only Strings? I think if it does then you really need to encode it using base64 or similar. The best approach overall would probably be to send it as raw bytes. Depending on how difficult it would be to refactor your code to support byte arrays instead of just Strings, that may be worth doing.
To answer your String question I just saw pop up in the comments, there's a getBytes method on a String.
For the socket question, see:
Java sending and receiving file (byte[]) over sockets

Debugging if UTF-8 decoding is done correctly?

We have a Java code talking to external system over TCP connections with xml messages encoded in UTF-8.
The message received begin with '?'. SO the XML received is
?<begin>message</begin>
There is a real doubt if the first character is indeed '?'. At the moment, we cannot ask the external system if/what.
The code snippet for reading the stream is as below.
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, Charset.forName("UTF-8")));
int readByte = reader.read();
if (readByte <= 0) {
inputStream.close();
}
builder.append((char) readByte);
We are currently trying to log the raw bytes int readByte = inputStream.read(). The logs will take few days to be received.
In the mean time, I was wondering how we could ascertain at our end if it was truly a '?' and not a decoding issue?

I suspect strongly you have a byte-order-mark at the beginning of your doc. That won't render as a valid character, and consequently could appear as a question mark. Can you dump the raw bytes out and check for that sequence ?

Your question seems to boil down to this:
Can we ascertain the real value of the first byte of the message without actually looking at it.
The answer is "No, you can't". (Obviously!)
...
However, if you could intercept the TCP/IP traffic from the external system with a packet sniffer (aka traffic monitoring tool), then dumping the first byte or bytes of the message would be simple ... requiring no code changes.
Is logging the int returned by inputStream.read() the correct way to to analyse the bytes received. Or does the word length of the OS or other environment variables come into picture.
The InputStream.read() method returns either a single (unsigned) byte of data (in the range 0 to 255 inclusive) or -1 to indicate "end of stream". It is not sensitive to the "word length" or anything else.
In short, provided you treat the results appropriately, calling read() should give you the data you need to see what the bytes in the stream really are.

Parsing a TCP packet

I'm having some trouble to parse a TCP packet from a socket...
In my protocol, my messages are like this:
'A''B''C''D''E'.........0x2300
'A''B''C''D''E' --> start message pattern
0x2300 --> two bytes end message
But due to the Nagle's algorithm, sometimes my messages are concatenated like:
'A''B''C''D''E'.........0x2300'A''B''C''D''E'.........0x2300'A''B''C''D''E'.........0x2300
I already tried to setNoDelay() to true but the problem persists.
I have the message in a byte[].
How could I split my messages to be parsed individually?
PS: For now I am able to get the first message but the others are lost...

Just loop through you received data and check for end-markers. When found set a start index to the next package and continue searching. Something like this:
int packageStart = 0;
for(int i = 0; i < data.length - 1; i++) {
if(data[i] == 0x23 && data[i + 1] == 0x00) {
// Found end of package
i++;
processPackage(data, packageStart, i);
packageStart = i;
}
// At this point: from packageStart till data.length are unprocessed bytes...
As noted, there might be some left over data (if data did not end with the end-marker). You might want to keep it, so you can prepend it to the next batch of received data. And thus preventing data-loss due to chopped up TCP/IP packages.

You have to think of it as parsing a continuous stream of bytes. Your code needs to identify the start and end of a message.
Due to the way packets get sent, you may have a complete message, multiple messages, a partial message, etc. You code needs to identify when a message has begun and keep reading until it has found the end of a message or in some instance, when you've read more bytes than your max message size and you need to resync.
I've seen some comm managers drop and reestablish the connection (start over) and others throw away data until they can get back in sync. Then you get into the fun of whether you need guaranteed delivery and retransmission.
The best protocols are the simple ones. Create a message header which contains say an SOH byte, a two byte message length (or whatever is appropriate), a 2 byte message type and 1 byte message subtype. You can also end the message with any number of bytes. Look at an ASCII chart, there's a number of Hex bytes 00-1F that are pretty standard since the terminal days.
No point in reinventing the wheel here. Makes it easier, because you know how long this message should be instead of looking for patterns in the data.

It sounds like you need to treat it like a Byte Stream and buffer the packets until you see your EOF code 0x2300.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.