XML Parsing: Parsing the entire xml for one field

XML Parsing: Parsing the entire xml for one field - java

I have a very large XML which I receive as input. From this XML I just need a single child element. Parsing the entire XML to retrieve just one element seems like an performance overkill. Are there any better approaches to resolve this issue?
One approach would be to use the DocumentBuilder API to parse the XML and then using XPath to retrieve the desired field. But the parse method will still unnecessarily parse the entire xml. Is there an overloaded parse method in any implementation of parser which takes the xpath and parses the XML only according to the XPath.

What you need is a SAX parser or a similar fast parser. SAX parsers do not parse the entire XML, they just parse the xml to the point until they find the element they are looking for.
You can read about SAX parsers in wikipedia's link. Also have a look at the java docs for SAX parser

Although there is no way around parsing for the proper treatment of your XML data, there is definitely a way around building an in-memory representation of the entire document. Java offers SAX parsing, which is event-based. You can implement an event handler for XML events, ignoring everything on the way to the content that you need, and stopping after retrieving the part that you are looking for.
Here is a tutorial from Oracle showing how to use SAX APIs to retrieve counts of individual tags without building a document in memory.
Since most XPath processors work with SAX as well, you could potentially feed events to an XPath processor, and look for the desired tag in that way, too. However, this may be an overkill for a situation when you need to fetch a single element.

XPath operates over the document object model. So you have to have a DOM in order to evaluate an XPath expression. Otherwise what would it validate against?
So XPath is out if you don't want to parse the document. Your other options are fast SAX parsing, where you ignore all SAX parsing events until you get to the element that you want, extract the text that you want, and then abandon the rest of the parsing process.
The other option is to go way simpler: use grep.

Related

How to find xml tags in string content?

I have a String content that comes from any soap XML request.
I want to I identify a specific tag <FieldName> having the value MyFieldIndicator, and from there fetch the value of the next occuring <FieldValue>.
How could I do this? Is there any library? Or which mechanisms could be used to extract those tags from a plain string?
Performance matters, so it should be as quick as possible.

the library dom4j is probably easiest to use and understand. You need to create a Document-Object from string and then access via iterators the tags. The documentation and examples can be fopund here
Generally u can probably use any DOM or SAX parser.
Other DOM parser : W3C, JDOM
SAX parser : JAXP
If u by any chance want to store the content into an object : try to learn JAXB

Android: DOM vs SAX vs XMLPullParser parsing?

I am parsing XML Document using SAX Parser.
I want to know which is better and faster to work with DOM, SAX Parser or XMLPullParser.

it depends on what are you doing , if you have very large files then you should use SAX parser since it will fire events and releasing them ,nothing is stored in memory ,and using SAX parser you can't access element in a random way there is no going back ! , but Dom let you access any part of the xml file since it keeps the whole file/document in memory . hope this answer you question .
if you want to know which fastest parser Xerces is going to be the fastest you'll find and SAX parser should give you more performance than Dom

The SAX XML Parser already available into the Android SDK.
http://developer.android.com/reference/org/xml/sax/XMLReader.html
so it is easy to access.

One aspect by which different kinds of parsers can be classified is whether they need to load the entire XML document into memory up front. Parsers based on the Document Object Model (DOM) do that: they parse XML documents into a tree structure, which can then be traversed in-memory to read its contents. This allows you to traverse a document in arbitrary order, and gives rise to some useful APIs that can be slapped on top of DOM, such as XPath, a path query language that has been specifically designed for extracting information from trees. Using DOM alone isn’t much of a benefit because its API is clunky and it’s expensive to always read everything into memory even if you don’t need to. Hence, DOM parsers are, in most cases, not the optimal choice to parse XML on Android.
There are class of parsers that don’t need to load a document up front. These parsers are stream-based, which means they process an XML document while still reading it from the data source (the Web or a disk). This implies
that you do not have random access to the XML tree as with DOM because no internal representation of the document is being maintained. Stream parsers can be further distinguished from each other. There are push parsers that, while streaming the document, will call back to your application when encountering a new element. SAX parsers, fall into this class. Then there are pull parsers, which are more like iterators or cursors: here the client must explicitly ask for the next element to be retrieved.
Source: Android in Practice.

Reading Huge XML File using StAX and XPath

The input file contains thousands of transactions in XML format which is around 10GB of size. The requirement is to pick each transaction XML based on the user input and send it to processing system.
The sample content of the file
<transactions>
<txn id="1">
<name> product 1</name>
<price>29.99</price>
</txn>
<txn id="2">
<name> product 2</name>
<price>59.59</price>
</txn>
</transactions>
The (technical)user is expected to give the input tag name like <txn>.
We would like to provide this solution to be more generic. The file content might be different and users can give a XPath expression like "//transactions/txn" to pick individual transactions.
There are few technical things we have to consider here
The file can be in a shared location or FTP
Since the file size is huge, we can't load the entire file in JVM
Can we use StAX parser for this scenario? It has to take XPath expression as a input and pick/select transaction XML.
Looking for suggestions. Thanks in advance.

If performance is an important factor, and/or the document size is large (both of which seem to be the case here), the difference between an event parser (like SAX or StAX) and the native Java XPath implementation is that the latter builds a W3C DOM Document prior to evaluating the XPath expression. [It's interesting to note that all Java Document Object Model implementations like the DOM or Axiom use an event processor (like SAX or StAX) to build the in-memory representation, so if you can ever get by with only the event processor you're saving both memory and the time it takes to build a DOM.]
As I mentioned, the XPath implementation in the JDK operates upon a W3C DOM Document. You can see this in the Java JDK source code implementation by looking at com.sun.org.apache.xpath.internal.jaxp.XPathImpl, where prior to the evaluate() method being called the parser must first parse the source:
Document document = getParser().parse( source );
After this your 10GB of XML will be represented in memory (plus whatever overhead) — probably not what you want. While you may want a more "generic" solution, both your example XPath and your XML markup seem relatively simple, so there doesn't seem to be a really strong justification for an XPath (except perhaps programming elegance). The same would be true for the XProc suggestion: this would also build a DOM. If you truly need a DOM you could use Axiom rather than the W3C DOM. Axiom has a much friendlier API and builds its DOM over StAX, so it's fast, and uses Jaxen for its XPath implementation. Jaxen requires some kind of DOM (W3C DOM, DOM4J, or JDOM). This will be true of all XPath implementations, so if you don't truly need XPath sticking with just the events parser would be recommended.
SAX is the old streaming API, with StAX newer, and a great deal faster. Either using the native JDK StAX implementation (javax.xml.stream) or the Woodstox StAX implementation (which is significantly faster, in my experience), I'd recommend creating a XML event filter that first matches on element type name (to capture your <txn> elements). This will create small bursts of events (element, attribute, text) that can be checked for your matching user values. Upon a suitable match you could either pull the necessary information from the events or pipe the bounded events to build a mini-DOM from them if you found the result was easier to navigate. But it sounds like that might be overkill if the markup is simple.
This would likely be the simplest, fastest possible approach and avoid the memory overhead of building a DOM. If you passed the names of the element and attribute to the filter (so that your matching algorithm is configurable) you could make it relatively generic.

Stax and xpath are very different things. Stax allows you to parse a streaming XML document in a forward direction only. Xpath allows parsing in both directions. Stax is a very fast streaming XML parser, but, if you want xpath, java has a separate library for that.
Take a look at this question for a very similar discussion: Is there any XPath processor for SAX model?

We regularly parse 1GB+ complex XML files by using a SAX parser which does exactly what you described: It extracts partial DOM trees that can be conveniently queried using XPATH.
I blogged about it here - It's using a SAX not a StAX parser, but may be worth a look at.

It's definitely a use case for XProc with a streaming and parallel processing implementation like QuiXProc (http://code.google.com/p/quixproc)
In this situation, you will have to use
<p:for-each>
<p:iteration-source select="//transactions/txn"/>
<!-- you processing on a small file -->
</p:for-each>
You can even wrapp each of the resulting transformation with a single line of XProc
<p:wrap-sequence wrapper="transactions"/>
Hope this helps

A fun solution for processing huge XML files >10GB.
Use ANTLR to create byte offsets for the parts of interest. This will save some memory compared with a DOM based approach.
Use Jaxb to read parts from byte position
Find details at the example of wikipedia dumps (17GB) in this SO answer https://stackoverflow.com/a/43367629/1485527

Streaming Transformations for XML (STX) might be what you need.

Do you need to process it fast or you need fast lookups in the data ? These requirements need different approach.
For fast reading of the whole data StAX will be OK.
If you need fast lookups than you could need to load it to some database, Berkeley DB XML e.g.

Xml Query in java?

I am new to this validation process in Java...
-->XML file named Validation Limits
-->Structure of the XML
parameter /parameter
lowerLimit /lowerLimit
upperLimit /upperLimit
enable /enable
-->Depending the the enable status, 'true or false', i must perform the validation process for the respective parameter
--> what could be the best possible method to perform this operation...
I have parsed the xml (DOM) [forgot this to mention earlier] and stored the values in the arrays but is complicated with lot of referencing that take place from one array to another. If any better method that could replace array procedure will be helpful
Thank you in advance.

Try using a DOM or SAX parser, they will do the parsing for you. You can find some good, free tutorials in the internet.
The difference between DOM and SAX is as follows: DOM loads the XML into a tree structure which you can browse through (i.e. the whole XML is loaded), whereas SAX parses the document and triggers events (calls methods) in the process. Both have advantages and disadvantages, but personally, for reasonably sized XML files, I would use DOM.
So, in your case: use DOM to get a tree of your XML document, locate the attribute, see other elements depending on it.
Also, you can achieve this in pure XML, using XML Schema, although this might be too much for simple needs.

XML Parsing in java [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Best method to parse various custom XML documents in Java
HI all,
I am beginner to java. I hope the question I am asking may be easy one. My question is if I had an XML file and i want to parse it get the elements only with in specific tag.
for example if XML file looks like..
<date>2005-10-31</date>
<number>12345</number>
<purchased-by>
<name>My name</name>
<address>My address</address>
</purchased-by>
<order-items>
<item>
<code>687</code>
<type>CD</type>
<label>Some music</label>
</item>
<item>
<code>129851</code>
<type>DVD</type>
<label>Some video</label>
</item>
</order-items>
And from this XML I want to parse only the elements with in the tag name order-items.
Is there any generic way to do this..?Please let me know..
Thanks

As said in the comments, a short Google Search should bring you to the SUN examples on how to do this. Basically, you have two main XML parsing methods in Java :
SAX, where you use an handler to only grab what you want in your XML and ditch the rest
DOM, which parses your file all along, and allows you to grab all elements in a more tree-like fashion.
Another very useful XML parsing method, albeit a little more recent than these ones, and included in the JRE only since Java6, is StAX. StAX was conceived as a medial method between the tree-based of DOM and event-based approach of SAX. It is quite similar to SAX in the fact that parsing very large documents is easy, but in this case the application "pulls" info from the parser, instead of the parsing "pushing" events to the application. You can find more explanation on this subject here.
So, depending on what you want to achieve, you can use one of these approaches.

If you want to limit the parsing operation itself to the <order-items> element, then you'll have to use SAX. A SAX parser visits all elements of the input "file" (or stream) and you can define, that the parser shall ignore anything that is not <order-items> or any of its children. The result will be a Document containing these elements only.
If the xml documents are rather small and performance is not a limiting factor, then simply parse the whole document (that's a 2-liner) and use XPath expressions to select the correct nodes.

Use XPath. It lets you select nodes on their name and loads of other conditions. Very little code involved to setup.
IBM Example

It is a classic case for SAX. Register handler that receives tags and ignore all tags other than order-items.
Probably better way is to use Apache Digester but it is over-kill for your specific task.

You can use a DOM Parser to build a Document and then extract whatever elements you want using the getElementsByTagName method.
Here is some sample code to help you get started:
//parse file and build Document
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("file.xml"));
//get list of elements called order-items
NodeList orderItemsNodes = doc.getElementsByTagName("order-items");
//iterate over the elements
for(int i = 0 ; i <orderItemsNodes.getLength();i++ ){
Node orderItemNode = orderItemsNodes.item(i);
}

It honestly depends on how you are planning to use the item data. If you want to parse it into object and then work with it, I would use jaxb marshalling, but if you just want to strip string values from code, type, and label attributes of each item element, you may just consider using simple regex matching on the xml string - match content for each item tag, then match each attribute and extract its value.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.