Parse large XML files over a network - java

I did some quick searching on the site and couldn't seem to find the answer I was looking for so that being said, what are some best practices for passing large xml files across a network. My thoughts on the matter are to stream chunks across the network in manageable segments, however I am looking for other approaches and best practices for this. I realize that large is a relative term so I will let you choose an arbitrary value to be considered large.
In case there is any confusion the question is "What are some best practices for sending large xml files across networks?"
Edit:
I am seeing a lot of compression being talked about, any particular compression algorithm that could be utilized and in terms of decompressing said files? I do not have much desire to roll my own when I am aware there are proofed algorithms out there. Also I appreciate the responses so far.

Compressing and reducing XML size has been an issue for more than a decade now, especially in mobile communications where both bandwidth and client computation power are scarce resources. The final solution used in wireless communications, which is what I prefer to use if I have enough control on both the client and server sides, is WBXML (WAP Binary XML Spec).
This spec defines how to convert the XML into a binary format which is not only compact, but also easy-to-parse. This is in contrast to general-purpose compression methods, such as gzip, that require high computational power and memory on the receiver side to decompress and then parse the XML content. The only downside to this spec is that an application token table should exist on both sides which is a statically-defined code table to hold binary values for all possible tags and attributes in an application-specific XML content. Today, this format is widely used in mobile communications for transmitting configuration and data in most of the applications, such as OTA configuration and Contact/Note/Calendar/Email synchronization.
For transmitting large XML content using this format, you can use a chunking mechanism similar to the one proposed in SyncML protocol. You can find a design document here, describing this mechanism in section "2.6. Large Objects Handling". As a brief intro:
This feature provides a means to synchronize an object whose size exceeds that which can be transmitted within one message (e.g. the maximum message size – declared in MaxMsgSize
element – that the target device can receive). This is achieved by splitting the object into chunks that will each fit within one message and by sending them contiguously. The first chunk of data is sent with the overall size of the object and a MoreData tag signaling that more chunks will be sent. Every subsequent chunk is sent with a MoreData tag, except from the last one.

Depending on how large it is, you might want to considering compressing it first. This, of course, depends on how often the same data is sent and how often it's changed.
To be honest, the vast majority of the time, the simplest solution works fine. I'd recommend transmitting it the easiest way first (which is probably all at once), and if that turns out to be problematic, keep on segmenting it until you find a size that's rarely disrupted.

Compression is an obvious approach. This XML bugger will shrink like there is no tomorrow.

If you can keep a local copy and two copies at the server, you could use diffxml to reduce what you have to transmit down to only the changes, and then bzip2 the diffs. That would reduce the bandwidth requirement a lot, at the expense of some storage.

Are you reading the XML with a proper XML parser, or are you reading it with expectations of a specific layout?
For XML data feeds, waiting for the entire file to download can be a real waste of memory and processing time. You could write a custom parser, perhaps using a regular expression search, that looks at the XML line-by-line if you can guarantee that the XML will not have any linefeeds within tags.
If you have code that can digest the XML a node-at-a-time, then spit it out a node-at-a-time, using something like Transfer-Encoding: chunked. You write the length of the chunk (in hex) followed by the chunk, then another chunk, or "0\n" at the end. To save bandwidth, gzip each chunk.

Related

Is file compression a legit way to improve the performance of network communication for a large file

i have a large file and i want to send it over the network to multiple consumers via pub sub method. For that purpose, i choose to user jeromq. The code is working but i am not happy because i want to optimize it. The file is too large over than 2gb my question now is if i send a compressed file for example with gzip to consumers will the performance improve or the compress/decompress method introduces additional overhead so the performance will not be improved? What do you think?
In addition except for the compression is there any other technique to use?
For example, use erasure coding and splitting data into chunks and send chunks to consumers and then communicate to each other to retrieve the original one.
(Maybe my second idea is stupid or i dont have understand something correct please give me your directions.)
if i send a compressed file for example with gzip to consumers will the performance improve or the compress/decompress method introduces additional overhead so the performance will not be improved?
Gzip has a relatively good compression ratio (though not the best) but it is slow. In practice, it is so slow that the network interconnect can be faster than compressing+decompressing a data stream. Gzip is only fine for relatively slow network. There are faster compression algorithms to do that, but with generally lower compression ratio. For example LZ4, is very fast for both compressing and decompressing a data stream in real time. There is a catch though: the compression ratio is strongly dependent of the kind of file being sent. Indeed, binary files or already compressed data will barely be compressed, especially with fast algorithm like LZ4 so compression will not worth it. For text-based data or ones with repeated pattern (or a reduced range of byte values), the compression can be very useful.
For example, use erasure coding and splitting data into chunks and send chunks to consumers and then communicate to each other to retrieve the original one.
This is a common broadcast algorithm in distributed computing. This methods is used in distributed hash-table algorithms and also in MPI implementations (massively used on supercomputers). For example, MPI use a tree-based broadcast method when the number of machines is relatively big or/and the amount of data is also big. Note that this method introduce additional latency overheads that are generally small in your case unless the network has a very high latency. Distributed hash-table use more complex algorithm since they generally consider a not can fail (so they use dynamic adaptation) at any time and cannot be fully trusted (so they use checks like hashes and sometime even get data from multiple sources so to avoid malicious injection from specific machines). They also do not make assumption about the network structure/speed (resulting in an imbalanced download-speed).

design pattern for streaming protoBuf messages

I want to stream protobuf messages onto a file.
I have a protobuf message
message car {
... // some fields
}
My java code would create multiple objects of this car message.
How should I stream these messages onto a file.
As far as I know there are 2 ways of going about it.
Have another message like cars
message cars {
repeated car c = 1;
}
and make the java code create a single cars type object and then stream it to a file.
Just stream the car messages onto a single file appropriately using the writeDelimitedTo function.
I am wondering which is the more efficient way to go about streaming using protobuf.
When should I use pattern 1 and when should I be using pattern 2?
This is what I got from https://developers.google.com/protocol-buffers/docs/techniques#large-data
I am not clear on what they are trying to say.
Large Data Sets
Protocol Buffers are not designed to handle large messages. As a
general rule of thumb, if you are dealing in messages larger than a
megabyte each, it may be time to consider an alternate strategy.
That said, Protocol Buffers are great for handling individual messages
within a large data set. Usually, large data sets are really just a
collection of small pieces, where each small piece may be a structured
piece of data. Even though Protocol Buffers cannot handle the entire
set at once, using Protocol Buffers to encode each piece greatly
simplifies your problem: now all you need is to handle a set of byte
strings rather than a set of structures.
Protocol Buffers do not include any built-in support for large data
sets because different situations call for different solutions.
Sometimes a simple list of records will do while other times you may
want something more like a database. Each solution should be developed
as a separate library, so that only those who need it need to pay the
costs.
Have a look at Previous Question. Any difference in size and time will be minimal
(option 1 faster ??, option 2 smaller).
My advice would be:
Option 2 for big files. You process message by message.
Option 1 if multiple languages are need. In the past, delimited was not supported in all languages, this seems to be changing though.
Other wise personel preferrence.

Send and receive data over a 'noisy' data stream

My Java program saves its data to a binary file, and (very) occasionally the file becomes corrupt due to a hardware fault. Usually only a few bytes are affected in a file that is several megabytes in size. To cope with this problem, I could write the data twice, but this seems overkill - I would prefer to limit the file size increase to about 20%.
This seems to me to be similar to the problem of sending information over a 'noisy' data stream. Is there a Java library or algorithm that can write redundant information to an output stream so the receiver can recover when noise is introduced?
What you want is Error Correction Codes. Check this code out: http://freshmeat.net/projects/javafec/
As well, the wikipedia article might give you more information:
http://en.wikipedia.org/wiki/Forward_error_correction
Your two possiblities are Forward Error Correction, where you send redundant data, or a system for error detection, where you check a hash value and re-request any data that has become corrupted. If corruption is an expected thing, error correction is the approach to take.
Without knowing the nature of your environment, giving more specific advice isn't really possible, but this should get you started on knowing how to approach this problem.
Error correcting codes. If I recall correctly the number of additional bits goes as log n for the block size, so the larger blocks the fewer correction bits.
You should choose a mechanism that interleaves the checkbits (probably most convenient as extra characters) in between the normal text. This allows for having repairable holes in your data stream while still being readable.
The problem of noisy communications has already has a great solution: Send a hash/CRC of the data (with the data) which is (re)evaluated by the receiver and re-requested if there was corruption en route.
In other words: use a hash algorithm to check for corruption and retransmit when necessary instead of sending data redundantly.
CRCs and ECCs are the stand answer to detecting and (for ECCs) recovering from data corruption due to noise. However, any scheme can only cope with a certain level of noise. Beyond that level you will get undetected and/or uncorrectable errors. The second problem is that these schemes will only work if you can add the ECCs / CRCs before the noise is injected.
But I'm a bit suspicious that you may be trying to address the wrong problem:
If the corruption is occurring when you transmit the file over a comms line, then you should be using comms hardware with built-in ECC etc support.
If the corruption is occurring when you write the file to disc, then you should replace the disc.
You should also consider the possibility that it is your application that is corrupting the data; e.g. due to some synchronization bug in your code.
Sounds antiquated, but funny, I just had a similar conversation with someone who wrote "mobile" apps (not PDA/phone but Oil&Gas drilling rig-style field applications). Due to the environment they actually wrote to disk in a modified XMODEM CRC transfer. I think it is easy to say however nothing special there other than to:
Use a RandomAccessFile in "rw" write a block of data (512-4096 bytes), re-read for CRC check, re-write if non-match, or iterate to next block. With OS file caching I'm curious how effective this is?

Searching for regex patterns on a 30GB XML dataset. Making use of 16gb of memory

I currently have a Java SAX parser that is extracting some info from a 30GB XML file.
Presently it is:
reading each XML node
storing it into a string object,
running some regexex on the string
storing the results to the database
For several million elements. I'm running this on a computer with 16GB of memory, but the memory is not being fully utilized.
Is there a simple way to dynamically 'buffer' about 10gb worth of data from the input file?
I suspect I could manually take a 'producer' 'consumer' multithreaded version of this (loading the objects on one side, using them and discarding on the other), but damnit, XML is ancient now, are there no efficient libraries to crunch em?
Just to cover the bases, is Java able to use your 16GB? You (obviously) need to be on a 64-bit OS, and you need to run Java with -d64 -XMx10g (or however much memory you want to allocate to it).
It is highly unlikely memory is a limiting factor for what you're doing, so you really shouldn't see it fully utilized. You should be either IO or CPU bound. Most likely, it'll be IO. If it is, IO, make sure you're buffering your streams, and then you're pretty much done; the only thing you can do is buy a faster harddrive.
If you really are CPU-bound, it's possible that you're bottlenecking at regex rather than XML parsing.
See this (which references this)
If your bottleneck is at SAX, you can try other implementations. Off the top of my head, I can think of the following alternatives:
StAX (there are multiple implementations; Woodstox is one of the fastest)
Javolution
Roll your own using JFlex
Roll your own ad hoc, e.g. using regex
For the last two, the more constrained is your XML subset, the more efficient you can make it.
It's very hard to say, but as others mentioned, an XML-native database might be a good alternative for you. I have limited experience with those, but I know that at least Berkeley DB XML supports XPath-based indices.
First, try to find out what's slowing you down.
How much faster is the parser when you parse from memory?
Does using a BufferedInputStream with a large size help?
Is it easy to split up the XML file? In general, shuffling through 30 GiB of any kind of data will take some time, since you have to load it from the hard drive first, so you are always limited by the speed of this. Can you distribute the load to several machines, maybe by using something like Hadoop?
No Java experience, sorry, but maybe you should change the parser? SAX should work sequentially and there should be no need to buffer most of the file ...
SAX is, essentially, "event driven", so the only state you should be holding on to from element to element is state that relevant to that element, rather than the document as a whole. What other state are you maintaining, and why? As each "complete" node (or set of nodes) comes by, you should be discarding them.
I don't really understand what you're trying to do with this huge amount of XML, but I get the impression that
using XML was wrong for the data stored
you are buffering way beyond what you should do (and you are giving up all advantages of SAX parsing by doing so)
Apart from that: XML is not ancient and in massive and active use. What do you think all those interactive web sites are using for their interactive elements?
Are you being slowed down by multiple small commits to your db? Sounds like you would be writing to the db almost all the time from your program and making sure you don't commit too often could improve performance. Possibly also preparing your statements and other standard bulk processing tricks could help
Other than this early comment, we need more info - do you have a profiler handy that can scrape out what makes things run slowly
You can use the Jibx library, and bind your XML "nodes" to objects that represent them. You can even overload an ArrayList, then when x number of objects are added, perform the regexes all at once (presumably using the method on your object that performs this logic) and then save them to the database, before allowing the "add" method to finish once again.
Jibx is hosted on SourceForge: Jibx
To elaborate: you can bind your XML as a "collection" of these specialized String holders. Because you define this as a collection, you must choose what collection type to use. You can then specify your own ArrayList implementation.
Override the add method as follows (forgot the return type, assumed void for example):
public void add(Object o) {
super.add(o);
if(size() > YOUR_DEFINED_THRESHOLD) {
flushObjects();
}
}
YOUR_DEFINED_THRESHOLD
is how many objects you want to store in the arraylist until it has to be flushed out to the database. flushObjects(); is simply the method that will perform this logic. The method will block the addition of objects from the XML file until this process is complete. However, this is ok, the overhead of the database will probably be much greater than file reading and parsing anyways.
I would suggest to first import your massive XML file into a native XML database (such as eXist if you are looking for open source stuff, never tested it myself), and then perform iterative paged queries to process your data small chunks at a time.
You may want to try Stax instead of SAX, I hear it's better for that sort of thing (I haven't used it myself).
If the data in the XML is order independent, can you multi-thread the process to split the file up or run multiple processes starting in different locations in the file? If you're not I/O bound that should help speed it along.

Advice on handling large data volumes

So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once.
Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading.
Should I load everything into memory all at once?
If not, is opening what's a good way of loading the data partially?
What are some Java-relevant efficiency tips?
So then what if the processing requires jumping around in the data for multiple files and multiple buffers? Is constant opening and closing of binary files going to become expensive?
I'm a big fan of 'memory mapped i/o', aka 'direct byte buffers'. In Java they are called Mapped Byte Buffers are are part of java.nio. (Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmatically as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.
I suggest this approach because a) it works for me, and b) it will let you focus on your algorithm and let the JVM, OS and hardware deal with the performance optimization. All to frequently, they know what is best more so than us lowly programmers. ;)
How would you use MBBs in your context? Just create an MBB for each of your files and read them as you see fit. You will only need to store your results. .
BTW: How much data are you dealing with, in GB? If it is more than 3-4GB, then this won't work for you on a 32-bit machine as the MBB implementation is defendant on the addressable memory space by the platform architecture. A 64-bit machine & OS will take you to 1TB or 128TB of mappable data.
If you are thinking about performance, then know Kirk Pepperdine (a somewhat famous Java performance guru.) He is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips and other Java performance related things.
You might want to have a look at the entries in the Wide Finder Project (do a google search for "wide finder" java).
The Wide finder involves reading over lots of lines in log files, so look at the Java implementations and see what worked and didn't work there.
You could convert to binary, but then you have 1+ something copies of the data, if you need to keep the original around.
It may be practical to build some kind of index on top of your original ascii data, so that if you need to go through the data again you can do it faster in subsequent times.
To answer your questions in order:
Should I load everything into memory all at once?
Not if don't have to. for some files, you may be able to, but if you're just processing sequentially, just do some kind of buffered read through the things one by one, storing whatever you need along the way.
If not, is opening what's a good way of loading the data partially?
BufferedReaders/etc is simplest, although you could look deeper into FileChannel/etc to use memorymapped I/O to go through windows of the data at a time.
What are some Java-relevant efficiency tips?
That really depends on what you're doing with the data itself!
Without any additional insight into what kind of processing is going on, here are some general thoughts from when I have done similar work.
Write a prototype of your application (maybe even "one to throw away") that performs some arbitrary operation on your data set. See how fast it goes. If the simplest, most naive thing you can think of is acceptably fast, no worries!
If the naive approach does not work, consider pre-processing the data so that subsequent runs will run in an acceptable length of time. You mention having to "jump around" in the data set quite a bit. Is there any way to pre-process that out? Or, one pre-processing step can be to generate even more data - index data - that provides byte-accurate location information about critical, necessary sections of your data set. Then, your main processing run can utilize this information to jump straight to the necessary data.
So, to summarize, my approach would be to try something simple right now and see what the performance looks like. Maybe it will be fine. Otherwise, look into processing the data in multiple steps, saving the most expensive operations for infrequent pre-processing.
Don't "load everything into memory". Just perform file accesses and let the operating system's disk page cache decide when you get to actually pull things directly out of memory.
This depends a lot on the data in the file. Big mainframes have been doing sequential data processing for a long time but they don't normally use random access for the data. They just pull it in a line at a time and process that much before continuing.
For random access it is often best to build objects with caching wrappers which know where in the file the data they need to construct is. When needed they read that data in and construct themselves. This way when memory is tight you can just start killing stuff off without worrying too much about not being able to get it back later.
You really haven't given us enough info to help you. Do you need to load each file in its entiretly in order to process it? Or can you process it line by line?
Loading an entire file at a time is likely to result in poor performance even for files that aren't terribly large. Your best bet is to define a buffer size that works for you and read/process the data a buffer at a time.
I've found Informatica to be an exceptionally useful data processing tool. The good news is that the more recent versions even allow Java transformations. If you're dealing with terabytes of data, it might be time to pony up for the best-of-breed ETL tools.
I'm assuming you want to do something with the results of the processing here, like store it somewhere.
If your numerical data is regularly sampled and you need to do random access consider to store them in a quadtree.
I recommend strongly leveraging Regular Expressions and looking into the "new" IO nio package for faster input. Then it should go as quickly as you can realistically expect Gigabytes of data to go.
If at all possible, get the data into a database. Then you can leverage all the indexing, caching, memory pinning, and other functionality available to you there.
If you need to access the data more than once, load it into a database. Most databases have some sort of bulk loading utility. If the data can all fit in memory, and you don't need to keep it around or access it that often, you can probably write something simple in Perl or your favorite scripting language.

Categories

Resources