Java: read huge XML online, entry by entry

Java: read huge XML online, entry by entry - java

I am working with a huge XML file (wikipedia dump) and it certainly couldn't be read into a memory at once nor will it be practical to do so.
I googled SAX XML tutorials, but they all showing quite an ugly low-level approach, where you have to set flags manually and track what element you are in now.
Actually the whole dump consists of many relatively small page entries and a reasonable strategy looks like:
read the whole page entry into memory;
process it;
dispose it and move to the next.
It would require only the amount of memory to handle single page entry, while I could use all the conveniencies of parsed tree-like XML representation.
My questions are:
Is it possible to implement such a strategy in Java?
Is it possible to do so using Jsoup as it is my main tool for working with smaller XML files?

Related

is it possible to use Event Sourcing with StAX XML event sequences

I've got an XML performance problem that I'm looking to solve.
Specifically I've got the same small/medium sized XML file that's being parsed many hundreds of times.
The functionality is bound to a StAX XML event reader. Its output cannot be cloned or otherwise copied, the only way to reproduce the needed functionality is to run this XML event reader over the XML document again.
For performance I would like to read the XML into a StAX event sequence eagerly, and then replay that event sequence rather than re-parse the XML each time.
I believe the problem is implementation: while this idea is reasonable in principal, "Events" are expressed as state-changes against the XMLStreamReader which has a large API surface, a large portion (but not all) of which is related to its "current" event.
Does a system like this already exist?
If I have to build it myself, what might be the best way to ensure correctness?

The usual way to represent an XML document in memory, to avoid parsing it repeatedly, is to use one of the many tree models (JDOM2 and XOM are the best in my view, though many people still use the horrible old DOM model simply because it's packaged in the JDK). So I guess I'm asking why doesn't this "obvious" approach work for you?
There are cases where (internally within Saxon) I use a replayable stream of events instead, simply because storing the events and then replaying them is a bit more efficient than building a tree and then walking the tree. I don't use StaX events for this, I use my own class net.sf.saxon.event.EventBuffer which holds a list of net.sf.saxon.event.Event objects. Perhaps this event model is a bit better designed for the purpose, being rather simpler than the StAX model. Saxon doesn't have any logic to read an EventBuffer as a StAX event stream, but it would be easy enough to add. It's open source code, so see if you can adapt it.

What is an alternative to using DOM XML parser for large XML Documents for multiple find operations?

I am storing data for ranking users in XML documents - one row per user - containing a 36 char key, score, rank, and username as attributes.
<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<!DOCTYPE Ranks [<!ELEMENT Rank ANY ><!ATTLIST Rank id ID #IMPLIED>]>
<Ranks>
..<Rank id="<userKey>" score="36.0" name="John Doe" rank=15></Rank>..
</Ranks>
There are several such documents which are parsed on request using a DOM parser and kept in memory until the file is updated. This happens from within a HttpServlet which is backing a widget. Every time the widget is loaded it calls the servlet with a get request which then requires one of the documents to be queried. The queries on the documents require the following operations:
Look up - finding a particular ID
Iterate through each Rank element and get the id attribute
In my test environment the number of users is <100 and everything works well. However we are soon supposed to be delivering to a system with 200K+ users. I have serious concerns about the scalability of my approach - i.e. OutOfMemoryException!
I'm stuck for ideas for an implementation which balances performance and memory usage. While DOM is good for find operations it may choke because of the large size. I don't know much about StAX, but from what I have read it seems that it might solve the memory issue but could really slow down the queries as I will have to effectively iterate through the document to find the element of interest (Is that correct?).
Questions:
Is it possible to use StAX for multiple find (like getElementById) operations on large documents quick enough to serve an HttpRequest?
What is the maximum file size that a DOM Parser can handle?
Is it possible to estimate how much memory per user would be used for an XML document with the above structure?
Thanks
Edit: I am not allowed to use databases.
Edit: Would it be better/neater to use a custom formatted file instead and use Regular expressions to search the file for the required entry?

It sounds like you're using the xml document as a database. I think you'll be much happier using a proper database for this, and importing/exporting to xml as needed. Several databases work well, so you might as well use one that's well supported, like mysql or postgresql, although even sqlite will work better than xml.
In terms of SAX parsing, you basically build a large state machine that handles various events that occur while parsing (entering a tag, leaving a tag, seeing data, etc.). You're then on your own to manage memory (recording the data you see depending on the state you're in), so you're correct that it can have a better memory footprint, but running a query like that for every web request is ridiculous, especially when you can store all your data in a nice indexed database.

One of the big problems here is that DOM is not thread-safe, so even read operations need to be synchronized. From that point of view, using JDOM or XOM would definitely be better.
The other issue is the search strategy used to find the data. You really want the queries to be supported by indexing rather than using serial search. In fact, you need a decent query optimizer to generate efficient access paths. So given your constraint of not using a database, this sounds like a case for an in-memory XQuery engine with agressive optimization, for which the obvious candidate is Saxon-EE. But then I would say that, wouldn't I?

For heavy due XML processing, VTD-XML is the most efficient option available, it is far more efficent than JDOM, DOM4j or DOM... the key is non-object oriented approach of its info-set modeling... it is also far less likely to cause out of memory exception... Read this 2013 paper for the comprehensive comparison/benchmark between various XML frameworks
Processing XML with Java – A Performance Benchmark

Maximum size when parsing XML with DOM

Currently I'm implementing a REST client which shall parse the XML response messages. It is intended, later, to be running on an Android device. Thus, memory and processing speed is quite an issue. However there will be only one XML response at a time so processing or holding multiple XML documents at a time is not an issue.
As fas as I understood there are three ways of parsing XML with the Android SDK:
SAX
XmlPullParser
DOM
Reading about these different parsing methods I got that SAX is recommended for large XML files as it won't hold the complete tree in memory like DOM.
However, I'm asking myself what is large in terms of kilobytes, megabytes, ...? Is there a practical size up to which it does not really matter whether using SAX or DOM?
Thanks,
Robert

There are no standard limits set for XML documents or DOM size so it depends entirely on what the host machine can cope with.
As you are implementing on Android you should assume a pretty limited amount of memory, and remember the DOM, the XML parser, your program logic, the display logic, the JVM and Android itself all have to fit in the available memory!.
As a rule of thumb you can expect the DOM occupy memory about four times the size of the source XML document. So assume 512MB of available memory, aim to take no more than half of this for you DOM and you end up with 512/8 or a practical maximum of 64MB for the XML doc.
Just to be on the safe side I would half that again to a 32MB max. So if you expect many documents of this size I would swithc to SAX parsing!.
If you want the app to respond with any speed on large documents the SAX is the way to go. A SAX parser can start returning results as soon as the first element is read a DOM parser needs to read the whole document before any output can be sent to your program.

Excerpt from this article:
DOM parsers suffer from memory bloat. With smaller XML sets this isn't such an issue but as the XML size grows DOM parsers become less and less efficient making them not very scaleable in terms of growing your XML. Push parsers are a happy medium since they allow you to control parsing, thereby eliminating any kind of complex state management since the state is always known, and they don't suffer from the memory bloat of DOM parsers.
This could be the reason SAX is recommended over DOM: SAX functions as an XML push parser. Also, check out the Wikipedia article for SAX here.
EDIT: To address size specifically you would have to look at your implementation. An example of DOM Document object size in the memory of a Java-based XML parser is here. Java, like a lot of languages, defines some memory-based limitations such as the JVM heap size, and the Android web services/XML DOM API may also define some internal limits at the programmers' discretion (mentioned in part here). There is no one definitive answer as to maximum allowed size.

My experience let me say that using DOM the memory used is 2x the file size, but of course it's just an indication. If the XML tree has just one field containing the entire data, the memory used is similar to file size!

What is the fastest way to generate lots of XML data?

I'm working on a Java utility that generates a bunch of XML documents matching a specific DTD using slightly randomized layout generation (so, for example, the document might look like <a><b><c /></b></a> or it might look like <a><b/><b><c>text</c></b></a>.
Right now, I've gotten it to the point where I can generate roughly 32,000 documents per second (storing the files in /dev/shm/), and I feel like that's pretty good, but it leaves me wondering if maybe I could do it faster in C++ or maybe some other language with super-fast XML generation. Any contenders?

As for speed probably not. You are most likely bound by hard disk speed at that point. Be sure you are using a buffered class to write to the disk, but otherwise I don't know if it'll get a lot faster.
You could run different threads/instances if you had two hard drives--but writing 2 streams to one drive only slows things down.

Searching for regex patterns on a 30GB XML dataset. Making use of 16gb of memory

I currently have a Java SAX parser that is extracting some info from a 30GB XML file.
Presently it is:
reading each XML node
storing it into a string object,
running some regexex on the string
storing the results to the database
For several million elements. I'm running this on a computer with 16GB of memory, but the memory is not being fully utilized.
Is there a simple way to dynamically 'buffer' about 10gb worth of data from the input file?
I suspect I could manually take a 'producer' 'consumer' multithreaded version of this (loading the objects on one side, using them and discarding on the other), but damnit, XML is ancient now, are there no efficient libraries to crunch em?

Just to cover the bases, is Java able to use your 16GB? You (obviously) need to be on a 64-bit OS, and you need to run Java with -d64 -XMx10g (or however much memory you want to allocate to it).
It is highly unlikely memory is a limiting factor for what you're doing, so you really shouldn't see it fully utilized. You should be either IO or CPU bound. Most likely, it'll be IO. If it is, IO, make sure you're buffering your streams, and then you're pretty much done; the only thing you can do is buy a faster harddrive.
If you really are CPU-bound, it's possible that you're bottlenecking at regex rather than XML parsing.
See this (which references this)
If your bottleneck is at SAX, you can try other implementations. Off the top of my head, I can think of the following alternatives:
StAX (there are multiple implementations; Woodstox is one of the fastest)
Javolution
Roll your own using JFlex
Roll your own ad hoc, e.g. using regex
For the last two, the more constrained is your XML subset, the more efficient you can make it.
It's very hard to say, but as others mentioned, an XML-native database might be a good alternative for you. I have limited experience with those, but I know that at least Berkeley DB XML supports XPath-based indices.

First, try to find out what's slowing you down.
How much faster is the parser when you parse from memory?
Does using a BufferedInputStream with a large size help?
Is it easy to split up the XML file? In general, shuffling through 30 GiB of any kind of data will take some time, since you have to load it from the hard drive first, so you are always limited by the speed of this. Can you distribute the load to several machines, maybe by using something like Hadoop?

No Java experience, sorry, but maybe you should change the parser? SAX should work sequentially and there should be no need to buffer most of the file ...

SAX is, essentially, "event driven", so the only state you should be holding on to from element to element is state that relevant to that element, rather than the document as a whole. What other state are you maintaining, and why? As each "complete" node (or set of nodes) comes by, you should be discarding them.

I don't really understand what you're trying to do with this huge amount of XML, but I get the impression that
using XML was wrong for the data stored
you are buffering way beyond what you should do (and you are giving up all advantages of SAX parsing by doing so)
Apart from that: XML is not ancient and in massive and active use. What do you think all those interactive web sites are using for their interactive elements?

Are you being slowed down by multiple small commits to your db? Sounds like you would be writing to the db almost all the time from your program and making sure you don't commit too often could improve performance. Possibly also preparing your statements and other standard bulk processing tricks could help
Other than this early comment, we need more info - do you have a profiler handy that can scrape out what makes things run slowly

You can use the Jibx library, and bind your XML "nodes" to objects that represent them. You can even overload an ArrayList, then when x number of objects are added, perform the regexes all at once (presumably using the method on your object that performs this logic) and then save them to the database, before allowing the "add" method to finish once again.
Jibx is hosted on SourceForge: Jibx
To elaborate: you can bind your XML as a "collection" of these specialized String holders. Because you define this as a collection, you must choose what collection type to use. You can then specify your own ArrayList implementation.
Override the add method as follows (forgot the return type, assumed void for example):
public void add(Object o) {
super.add(o);
if(size() > YOUR_DEFINED_THRESHOLD) {
flushObjects();
}
}
YOUR_DEFINED_THRESHOLD
is how many objects you want to store in the arraylist until it has to be flushed out to the database. flushObjects(); is simply the method that will perform this logic. The method will block the addition of objects from the XML file until this process is complete. However, this is ok, the overhead of the database will probably be much greater than file reading and parsing anyways.

I would suggest to first import your massive XML file into a native XML database (such as eXist if you are looking for open source stuff, never tested it myself), and then perform iterative paged queries to process your data small chunks at a time.

You may want to try Stax instead of SAX, I hear it's better for that sort of thing (I haven't used it myself).

If the data in the XML is order independent, can you multi-thread the process to split the file up or run multiple processes starting in different locations in the file? If you're not I/O bound that should help speed it along.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: read huge XML online, entry by entry - java

Related

is it possible to use Event Sourcing with StAX XML event sequences

What is an alternative to using DOM XML parser for large XML Documents for multiple find operations?

Maximum size when parsing XML with DOM

What is the fastest way to generate lots of XML data?

Searching for regex patterns on a 30GB XML dataset. Making use of 16gb of memory

Categories

Resources