I want to stream protobuf messages onto a file.
I have a protobuf message
message car {
... // some fields
}
My java code would create multiple objects of this car message.
How should I stream these messages onto a file.
As far as I know there are 2 ways of going about it.
Have another message like cars
message cars {
repeated car c = 1;
}
and make the java code create a single cars type object and then stream it to a file.
Just stream the car messages onto a single file appropriately using the writeDelimitedTo function.
I am wondering which is the more efficient way to go about streaming using protobuf.
When should I use pattern 1 and when should I be using pattern 2?
This is what I got from https://developers.google.com/protocol-buffers/docs/techniques#large-data
I am not clear on what they are trying to say.
Large Data Sets
Protocol Buffers are not designed to handle large messages. As a
general rule of thumb, if you are dealing in messages larger than a
megabyte each, it may be time to consider an alternate strategy.
That said, Protocol Buffers are great for handling individual messages
within a large data set. Usually, large data sets are really just a
collection of small pieces, where each small piece may be a structured
piece of data. Even though Protocol Buffers cannot handle the entire
set at once, using Protocol Buffers to encode each piece greatly
simplifies your problem: now all you need is to handle a set of byte
strings rather than a set of structures.
Protocol Buffers do not include any built-in support for large data
sets because different situations call for different solutions.
Sometimes a simple list of records will do while other times you may
want something more like a database. Each solution should be developed
as a separate library, so that only those who need it need to pay the
costs.
Have a look at Previous Question. Any difference in size and time will be minimal
(option 1 faster ??, option 2 smaller).
My advice would be:
Option 2 for big files. You process message by message.
Option 1 if multiple languages are need. In the past, delimited was not supported in all languages, this seems to be changing though.
Other wise personel preferrence.
Related
I am receiving simple data across a network, name/value pairs in text. I need to write a process to take those pairs, chop them up and send them to a different area - via an event bus - to be processed. Each name represents a different data "type" or transformational operation that I will have to do on the value part of the string.
On the surface this is pretty straightforward, but I am looking for a simple, "correct" architecture to do this such that I can transform 300 different kinds of values. Here is a simple example:
input:
9,19.0,31,0.00,34,0.379579,37,1.319952,55,0.03,66,0.00,81,1.558965,82,1.578690,92,0.03,99,24.57,100,-8144.29,101,0.13,105,2.999,106,148.3,108,-155.8,111,4.263,112,155.0,113,-170.0,167,10.60,205,1.558965,231,-0.040,232,-93.8,237,75.0,238,0.100
Really means:
9,19.0
31,0.00
34,0.379579
etc...
In the case of the "34" event, I have to convert that number to a different value for display to the user. So each event type can, potentially, have a different conversion process. In reality, I think that there might be groups of conversion processes (like 12) across all 300 event types.
Here are my questions.
How do I handle all of the conversion types? Is this like the sort algorithm or a comparator that you can inject into a parser?
How can I be efficient in terms of object creation and GC? These pairs are coming off of a network and I do not want to just shed GC-able objects (I am on Android).
Should I map the various types to an Events that go on a bus and only have consumers sign up for the events they are interested in, or do I send a generic event to every consumer and have them decide each time (I am thinking the former, but also dreading the boilerplate).
Any general thoughts would be much appreciated.
I ended up letting the compiler worry about it for me. I have a HUGE switch statement with many, many cases for each specific type of conversion.
I have also combed through the code and pulled out object declarations for things that I will be reusing. I make sure that they are declared above the function that wants to use them, then I set them early in the function to whatever value they need to be.
I have a single event type, for the network data I am collecting. I let each receiver decide if it cares about that event. This just eliminated a ton of boilerplate code that would have to be written to have a factory create each different event type.
Let me describe the problem. A lot of suppliers send us data files in various formats (with various headers). We do not have any control on the data format (what columns the suppliers send us). Then this data needs to be converted to our standard transactions (this standard is constant and defined by us).
The challenge here is that we do not have any control on what columns suppliers send us in their files. The destination standard is constant. Now I have been asked to develop a framework through which the end users can define their own data transformation rules through UI. (say field A in destination transaction is equal to columnX+columnY or first 3 characters of columnZ from input file). There will be many such data transformation rules.
The goal is that the users should be able to add all these supplier files (and convert all their data to my company data from front end UI with minimum code change). Please suggest me some frameworks for this (preferably java based).
Worked in a similar field before. Not sure if I would trust customers/suppliers to use such a tool correctly and design 100% bulletproof transformations. Mapping columns is one thing, but how about formatting problems in dates, monetary values and the likes? You'd probably need to manually check their creations anyway or you'll end up with some really nasty data consistency issues. Errors caused by faulty data transformation are little beasts hiding in the dark and jumping at you when you need them the least.
If all you need is a relatively simple, graphical way to design data conversations, check out something like Talend Open Studio (just google it). It calls itself an ETL tool, but we used for all kinds of stuff.
What is the best procedure for reading real time data from a socket and plotting it in a graph? Graphing isn't the issue for me; my question is more related to storing the stream of data, which gets updated via a port. Java needs to be able to read and parse it in a queue, array, or hash map and thus plot the real time graph.
The complication here is that the app can get a large amount of data every second. Java needs to do some high-performance job to parse the data, clean the old records, and add new records continuously. Is there any real=time library I should use, or should I program it from scratch using scratch, e.g. create a queue or array to store the data and delete data after the queue/array reaches a certain size?
I would humbly suggest that perhaps you can consider LinkedBlockingQueue (use thread/executor to read/remove the data from the LinkedBlockingQueue).
If you do not wish to use these data structures but prefer a messaging server to take on that responsibility for some reason, ActiveMQ/ZMQ/RabbitMQ etc would help out (with either queues or topics depending on your use case - for instance in AMQ, use queue for one consumer, topic for multiple consumers unless you wish to use browsers with durable queues).
If it suits your use case, you could also look into actor libraries such as kilim or akka.
I'm trying to transfer a stream of strings from my C++ program to my Java program in an efficient manner but I'm not sure how to do this. Can anyone post up links/explain the basic idea about how do implement this?
I was thinking of writing my data into a text file and then reading the text file from my Java program but I'm not sure that this will be fast enough. I need it so that a single string can be transferred in 16ms so that we can get around 60 data strings to the C++ program in a second.
Text files can easily be written to and read from upwards with 60 strings worth of content in merely a few milliseconds.
Some alternatives, if you find that you are running into timing troubles anyway:
Use socket programming. http://beej.us/guide/bgnet/output/html/multipage/index.html.
Sockets should easily be fast enough.
There are other alternatives, such as the tibco messaging service, which will be an order of magnitude faster than what you need: http://www.tibco.com/
Another alternative would be to use a mysql table to pass the data, and potentially just set an environment variable in order to indicate the table should be queried for the most recent entries.
Or I suppose you could just use an environment variable itself to convey all of the info -- 60 strings isn't very much.
The first two options are the more respectable solutions though.
Serialization:
protobuf or s11n
Pretty much any way you do this will be this fast. A file is likely to be the slowest and it could be around 10ms total!. A Socket will be similar if you have to create a new connection as well (its the connect, not the data which will take most time) Using a socket has the advantage of the sender and receiver knowing how much data has been produced. If you use a file instead, you need another way to say, the file is complete now, you should read it. e.g. a socket ;)
If the C++ and Java are in the same process, you can use a ByteBuffer to wrap a C array and import into Java in around 1 micro-second.
I did some quick searching on the site and couldn't seem to find the answer I was looking for so that being said, what are some best practices for passing large xml files across a network. My thoughts on the matter are to stream chunks across the network in manageable segments, however I am looking for other approaches and best practices for this. I realize that large is a relative term so I will let you choose an arbitrary value to be considered large.
In case there is any confusion the question is "What are some best practices for sending large xml files across networks?"
Edit:
I am seeing a lot of compression being talked about, any particular compression algorithm that could be utilized and in terms of decompressing said files? I do not have much desire to roll my own when I am aware there are proofed algorithms out there. Also I appreciate the responses so far.
Compressing and reducing XML size has been an issue for more than a decade now, especially in mobile communications where both bandwidth and client computation power are scarce resources. The final solution used in wireless communications, which is what I prefer to use if I have enough control on both the client and server sides, is WBXML (WAP Binary XML Spec).
This spec defines how to convert the XML into a binary format which is not only compact, but also easy-to-parse. This is in contrast to general-purpose compression methods, such as gzip, that require high computational power and memory on the receiver side to decompress and then parse the XML content. The only downside to this spec is that an application token table should exist on both sides which is a statically-defined code table to hold binary values for all possible tags and attributes in an application-specific XML content. Today, this format is widely used in mobile communications for transmitting configuration and data in most of the applications, such as OTA configuration and Contact/Note/Calendar/Email synchronization.
For transmitting large XML content using this format, you can use a chunking mechanism similar to the one proposed in SyncML protocol. You can find a design document here, describing this mechanism in section "2.6. Large Objects Handling". As a brief intro:
This feature provides a means to synchronize an object whose size exceeds that which can be transmitted within one message (e.g. the maximum message size – declared in MaxMsgSize
element – that the target device can receive). This is achieved by splitting the object into chunks that will each fit within one message and by sending them contiguously. The first chunk of data is sent with the overall size of the object and a MoreData tag signaling that more chunks will be sent. Every subsequent chunk is sent with a MoreData tag, except from the last one.
Depending on how large it is, you might want to considering compressing it first. This, of course, depends on how often the same data is sent and how often it's changed.
To be honest, the vast majority of the time, the simplest solution works fine. I'd recommend transmitting it the easiest way first (which is probably all at once), and if that turns out to be problematic, keep on segmenting it until you find a size that's rarely disrupted.
Compression is an obvious approach. This XML bugger will shrink like there is no tomorrow.
If you can keep a local copy and two copies at the server, you could use diffxml to reduce what you have to transmit down to only the changes, and then bzip2 the diffs. That would reduce the bandwidth requirement a lot, at the expense of some storage.
Are you reading the XML with a proper XML parser, or are you reading it with expectations of a specific layout?
For XML data feeds, waiting for the entire file to download can be a real waste of memory and processing time. You could write a custom parser, perhaps using a regular expression search, that looks at the XML line-by-line if you can guarantee that the XML will not have any linefeeds within tags.
If you have code that can digest the XML a node-at-a-time, then spit it out a node-at-a-time, using something like Transfer-Encoding: chunked. You write the length of the chunk (in hex) followed by the chunk, then another chunk, or "0\n" at the end. To save bandwidth, gzip each chunk.