Android: DOM vs SAX vs XMLPullParser parsing?

Android: DOM vs SAX vs XMLPullParser parsing? - java

I am parsing XML Document using SAX Parser.
I want to know which is better and faster to work with DOM, SAX Parser or XMLPullParser.

it depends on what are you doing , if you have very large files then you should use SAX parser since it will fire events and releasing them ,nothing is stored in memory ,and using SAX parser you can't access element in a random way there is no going back ! , but Dom let you access any part of the xml file since it keeps the whole file/document in memory . hope this answer you question .
if you want to know which fastest parser Xerces is going to be the fastest you'll find and SAX parser should give you more performance than Dom

The SAX XML Parser already available into the Android SDK.
http://developer.android.com/reference/org/xml/sax/XMLReader.html
so it is easy to access.

One aspect by which different kinds of parsers can be classified is whether they need to load the entire XML document into memory up front. Parsers based on the Document Object Model (DOM) do that: they parse XML documents into a tree structure, which can then be traversed in-memory to read its contents. This allows you to traverse a document in arbitrary order, and gives rise to some useful APIs that can be slapped on top of DOM, such as XPath, a path query language that has been specifically designed for extracting information from trees. Using DOM alone isn’t much of a benefit because its API is clunky and it’s expensive to always read everything into memory even if you don’t need to. Hence, DOM parsers are, in most cases, not the optimal choice to parse XML on Android.
There are class of parsers that don’t need to load a document up front. These parsers are stream-based, which means they process an XML document while still reading it from the data source (the Web or a disk). This implies
that you do not have random access to the XML tree as with DOM because no internal representation of the document is being maintained. Stream parsers can be further distinguished from each other. There are push parsers that, while streaming the document, will call back to your application when encountering a new element. SAX parsers, fall into this class. Then there are pull parsers, which are more like iterators or cursors: here the client must explicitly ask for the next element to be retrieved.
Source: Android in Practice.

Related

XML Parsing: Parsing the entire xml for one field

I have a very large XML which I receive as input. From this XML I just need a single child element. Parsing the entire XML to retrieve just one element seems like an performance overkill. Are there any better approaches to resolve this issue?
One approach would be to use the DocumentBuilder API to parse the XML and then using XPath to retrieve the desired field. But the parse method will still unnecessarily parse the entire xml. Is there an overloaded parse method in any implementation of parser which takes the xpath and parses the XML only according to the XPath.

What you need is a SAX parser or a similar fast parser. SAX parsers do not parse the entire XML, they just parse the xml to the point until they find the element they are looking for.
You can read about SAX parsers in wikipedia's link. Also have a look at the java docs for SAX parser

Although there is no way around parsing for the proper treatment of your XML data, there is definitely a way around building an in-memory representation of the entire document. Java offers SAX parsing, which is event-based. You can implement an event handler for XML events, ignoring everything on the way to the content that you need, and stopping after retrieving the part that you are looking for.
Here is a tutorial from Oracle showing how to use SAX APIs to retrieve counts of individual tags without building a document in memory.
Since most XPath processors work with SAX as well, you could potentially feed events to an XPath processor, and look for the desired tag in that way, too. However, this may be an overkill for a situation when you need to fetch a single element.

XPath operates over the document object model. So you have to have a DOM in order to evaluate an XPath expression. Otherwise what would it validate against?
So XPath is out if you don't want to parse the document. Your other options are fast SAX parsing, where you ignore all SAX parsing events until you get to the element that you want, extract the text that you want, and then abandon the rest of the parsing process.
The other option is to go way simpler: use grep.

Sorting XML data by element?

If one needed to be able to display certain elements that contain some certain data and then sort them based on this data.
Which would be a better choice for a XML parser, DOM or SAX?
Also can either of these achieve sorting of XML data without the need of storing the data first?

Sorting will require you to read in all of the XML document to memory. So working with a DOM will probably be easier. There are good libraries available that make working with a DOM easier:
dom4j
JDOM

It would be a better to use STAX (Streaming API for XML), because it is universal solution for tiny or large files, but if your XML files isn't bigger you could use DOM, because it will be easier. Also you could make xpath query when using DOM, that could be helpful for you.
Woodstox
Aalto XML processor

Reading Huge XML File using StAX and XPath

The input file contains thousands of transactions in XML format which is around 10GB of size. The requirement is to pick each transaction XML based on the user input and send it to processing system.
The sample content of the file
<transactions>
<txn id="1">
<name> product 1</name>
<price>29.99</price>
</txn>
<txn id="2">
<name> product 2</name>
<price>59.59</price>
</txn>
</transactions>
The (technical)user is expected to give the input tag name like <txn>.
We would like to provide this solution to be more generic. The file content might be different and users can give a XPath expression like "//transactions/txn" to pick individual transactions.
There are few technical things we have to consider here
The file can be in a shared location or FTP
Since the file size is huge, we can't load the entire file in JVM
Can we use StAX parser for this scenario? It has to take XPath expression as a input and pick/select transaction XML.
Looking for suggestions. Thanks in advance.

If performance is an important factor, and/or the document size is large (both of which seem to be the case here), the difference between an event parser (like SAX or StAX) and the native Java XPath implementation is that the latter builds a W3C DOM Document prior to evaluating the XPath expression. [It's interesting to note that all Java Document Object Model implementations like the DOM or Axiom use an event processor (like SAX or StAX) to build the in-memory representation, so if you can ever get by with only the event processor you're saving both memory and the time it takes to build a DOM.]
As I mentioned, the XPath implementation in the JDK operates upon a W3C DOM Document. You can see this in the Java JDK source code implementation by looking at com.sun.org.apache.xpath.internal.jaxp.XPathImpl, where prior to the evaluate() method being called the parser must first parse the source:
Document document = getParser().parse( source );
After this your 10GB of XML will be represented in memory (plus whatever overhead) — probably not what you want. While you may want a more "generic" solution, both your example XPath and your XML markup seem relatively simple, so there doesn't seem to be a really strong justification for an XPath (except perhaps programming elegance). The same would be true for the XProc suggestion: this would also build a DOM. If you truly need a DOM you could use Axiom rather than the W3C DOM. Axiom has a much friendlier API and builds its DOM over StAX, so it's fast, and uses Jaxen for its XPath implementation. Jaxen requires some kind of DOM (W3C DOM, DOM4J, or JDOM). This will be true of all XPath implementations, so if you don't truly need XPath sticking with just the events parser would be recommended.
SAX is the old streaming API, with StAX newer, and a great deal faster. Either using the native JDK StAX implementation (javax.xml.stream) or the Woodstox StAX implementation (which is significantly faster, in my experience), I'd recommend creating a XML event filter that first matches on element type name (to capture your <txn> elements). This will create small bursts of events (element, attribute, text) that can be checked for your matching user values. Upon a suitable match you could either pull the necessary information from the events or pipe the bounded events to build a mini-DOM from them if you found the result was easier to navigate. But it sounds like that might be overkill if the markup is simple.
This would likely be the simplest, fastest possible approach and avoid the memory overhead of building a DOM. If you passed the names of the element and attribute to the filter (so that your matching algorithm is configurable) you could make it relatively generic.

Stax and xpath are very different things. Stax allows you to parse a streaming XML document in a forward direction only. Xpath allows parsing in both directions. Stax is a very fast streaming XML parser, but, if you want xpath, java has a separate library for that.
Take a look at this question for a very similar discussion: Is there any XPath processor for SAX model?

We regularly parse 1GB+ complex XML files by using a SAX parser which does exactly what you described: It extracts partial DOM trees that can be conveniently queried using XPATH.
I blogged about it here - It's using a SAX not a StAX parser, but may be worth a look at.

It's definitely a use case for XProc with a streaming and parallel processing implementation like QuiXProc (http://code.google.com/p/quixproc)
In this situation, you will have to use
<p:for-each>
<p:iteration-source select="//transactions/txn"/>
<!-- you processing on a small file -->
</p:for-each>
You can even wrapp each of the resulting transformation with a single line of XProc
<p:wrap-sequence wrapper="transactions"/>
Hope this helps

A fun solution for processing huge XML files >10GB.
Use ANTLR to create byte offsets for the parts of interest. This will save some memory compared with a DOM based approach.
Use Jaxb to read parts from byte position
Find details at the example of wikipedia dumps (17GB) in this SO answer https://stackoverflow.com/a/43367629/1485527

Streaming Transformations for XML (STX) might be what you need.

Do you need to process it fast or you need fast lookups in the data ? These requirements need different approach.
For fast reading of the whole data StAX will be OK.
If you need fast lookups than you could need to load it to some database, Berkeley DB XML e.g.

Xml Query in java?

I am new to this validation process in Java...
-->XML file named Validation Limits
-->Structure of the XML
parameter /parameter
lowerLimit /lowerLimit
upperLimit /upperLimit
enable /enable
-->Depending the the enable status, 'true or false', i must perform the validation process for the respective parameter
--> what could be the best possible method to perform this operation...
I have parsed the xml (DOM) [forgot this to mention earlier] and stored the values in the arrays but is complicated with lot of referencing that take place from one array to another. If any better method that could replace array procedure will be helpful
Thank you in advance.

Try using a DOM or SAX parser, they will do the parsing for you. You can find some good, free tutorials in the internet.
The difference between DOM and SAX is as follows: DOM loads the XML into a tree structure which you can browse through (i.e. the whole XML is loaded), whereas SAX parses the document and triggers events (calls methods) in the process. Both have advantages and disadvantages, but personally, for reasonably sized XML files, I would use DOM.
So, in your case: use DOM to get a tree of your XML document, locate the attribute, see other elements depending on it.
Also, you can achieve this in pure XML, using XML Schema, although this might be too much for simple needs.

How to access a subset of XML data in Java when the XML data is too large to fit in memory?

What I would really like is a streaming API that works sort of like StAX, and sort of like DOM/JDom.
It would be streaming in the sense that it would be very lazy and not read things in until needed. It would also be streaming in the sense that it would read everything forwards (but not backwards).
Here's what code that used such an API would look like.
URL url = ...
XMLStream xml = XXXFactory(url.inputStream()) ;
// process each <book> element in this document.
// the <book> element may have subnodes.
// You get a DOM/JDOM like tree rooted at the next <book>.
while (xml.hasContent()) {
XMLElement book = xml.getNextElement("book");
processBook(book);
}
Does anything like this exist?

You could do the following:
Scan the XML file using SAX or StAX and immediately serizalize everything back into a StringBuilder, i.e. create your own copy of the XML file.
If you encounter a endElement and you know you don't need the subtree you just parsed, clear the StringBuilder.
If you need it, you can build a DOM tree from the "copy" you created.
With this you can fall back to standard frameworks, one for conventional SAX parsing and one for conventional DOM building. Only the custom serizalization might require some hacking.
Also it helps if you need to know the tree boundaries in advance. (book elements in your example) Otherwise further processing would be required.

The only way to parse the part of the document without fully loading it to the memory is using the SAX parser.
Here are some official SUN examples of how to use SAX: http://java.sun.com/developer/codesamples/xml.html#sax

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.