XML Parsing / Dom Manipulation in Java

XML Parsing / Dom Manipulation in Java - java

I'm trying to figure out how best to translate this:
<Source><properties>
....
<name>wer</name>
<delay>
<type>Deterministic</type>
<parameters length="1">
<param value="78" type="Time"/>
</parameters>
</delay>
<batchSize>
<type>Cauchy</type>
<parameters length="2">
<param value="23" type="Alpha"/>
<param value="7878" type="Beta"/>
</parameters>
</batchSize>
...
</properties></Source>
Into:
<Source><properties>
....
<name>wer</name>
<delay>
<Deterministic Time="78"/>
</delay>
<batchSize>
<Cauchy Alpha="23" Beta="7878"/>
</batchSize>
........
</properties></Source>
I've tried using DocumentBuilderFactory, but I while I can access the value of the name tag, I cannot access the values in the delay/batch section. This is code I used
Element prop = (Element)propertyNode;
NodeList nodeIDProperties = prop.getElementsByTagName("name");
Element nameElement = (Element)nodeIDProperties.item(0);
NodeList textFNList = nameElement.getChildNodes();
String nodeNameValue = ((org.w3c.dom.Node)textFNList.item(0)).getNodeValue().trim();
//--------
NodeList delayNode = prop.getElementsByTagName("delay");
Calling getElementByName("type") or "parameters" doesn't seem to return anything I can work with. Am I missing something, or is there a cleaner way to process the exisiting xml.
The need to be in the defined format to allow for marshalling and unmarshalling by Castor.
Any help would be much appreciated.

There are a variety of ways to convert the XML.
1) You can use XSLT (XSL Transformations) to transform the XML. It is a XML based language used to transform XML documents in other XML documents, text, or HTML. The syntax is hard to learn. However it is a powerful tool for XML conversion. Here is a tutorial. For using XSLT with Java I would recommend Saxon which also comes with a nice documentation. The big plus using XSLT is that the conversion can be externalized in a seperate template. So your Java code is not obfuscated by the translation stuff. However, as mentioned the learning curve is definitly steeper.
2) You can use XPath to select the nodes easily. XPath is a query language for selecting nodes in a XML document. XPath is also used in XSLT by the way. E.g. the XPath query
//delay[type = 'Deterministic']/parameters/param/#value
selects all parameters value which are contained in a node param which are a child of delay containing a node type with the value "Deterministic". Here is a nice web application for testing XPath queries. Here is a tutorial how to use XPath in Java and here about XPath in general. You can use XPath expressions to select the right nodes in your Java code. IMHO this is far more readable and maintainable than using the DOM object model directly (which is also from time to time ackward as you have already learned).
3) You can use Smooks for doing XML transformations. This is especially useful if the transformation gets rather complex. Smooks populates a object model from the input XML and outputs the result XML via a templating mechanism either using Freemarker or XSL templates. Smooks has a very high througput and is used in high performance environments like ESBs (e.g. JBoss ESB, Apache ServiceMix). Might be overpowered for yur scenario though.
4) You could use Freemarker to do the transformation. I have no experience in this, but as I heared it can be used fairly simple. See the "Declarative XML processing" section of the documentation (also take a look at "Exposing XML documents" to learn how to read the source XML). Seems fairly simple to me. If you try your luck with this approach, I would love to hear about it.

This looks like a job for XPATH or some other XML transformation API.
Check out: http://www.ibm.com/developerworks/library/x-javaxpathapi.html

Although probably XSLT is the best way to do this, if you want to use a JVM programming language and you want to learn a different approach, you can try scala's xml transformation library.
Some blog posts:
http://scala.sygneca.com/code/xml-pattern-matching
http://debasishg.blogspot.com/2006/08/xml-integration-in-java-and-scala.html

XSLT was the way forward in the end. Its actually pretty easy to use and the w3schools example is a good place to start.

Related

Dom4j vs JAXB for reading and updating large and complex XML files

I have an XML file with a stable tree structure and more than 5000 elements.
A fraction of it is below:
<Companies>
<Offices>
<RevenueInfo>
<TransactionId>14042015014606877</TransactionId>
<Company>
<Identification>
<GlobalId>25142400905</GlobalId>
<BranchId>373287734</BranchId>
<GeoId>874</GeoId>
<LastUpdated>2015-04-14T01:46:06.940</LastUpdated>
<RecordType>7785</RecordType>
</Identification>
<Info>
<DataEntry>
<EntryId>12345</EntryId>
</DataEntry>
<DataEntry>
<EntryId>34567</EntryId>
</DataEntry>
<DataEntry>
<EntryId>89076</EntryId>
</DataEntry>
<DataEntry>
<EntryId>13211</EntryId>
</DataEntry>
</Info>
...more elements
</Company>
</RevenueInfo>
</Offices>
</Companies>
I need to be able to update any of the values in the document based on user input and create a new XML file with the updated information. User will pass BranchId, the name of the element to update and it's number of order if multiple occurring element ( for example, for EntryId 12345 the user will pass 373287734 EntryId=1 010101 )
I've been looking at JAXB but it seems like a considerable effort to create the model classes for this kind of XML but it also seems like it would make printing to file and locating the element to update a lot easier.
Dom4j seems to have good performance results too, but not sure how parsing will be.
My question is, is JAXB the best approach in this case or can you suggest a better way to parse this type of XML?

In my experience JAXB only works well when the schema is simple and stable. In other cases you are better off using a generic tree model. The main generic models in the Java world are DOM, JDOM2, DOM4J, XOM, AXIOM. My own preferences are JDOM2 and XOM; DOM4J seems to me overcomplex, and somewhat old-fashioned. But it depends what you are looking for.
But then, the application you describe looks an ideal candidate for an "XML end-to-end" or XRX approach - XForms, XSLT, XQuery, XProc. You don't need Java at all.

Leaving performance and memory requirements aside, I would recommend trying XPath together with DOM4J (or JDOM, or even plain DOM). To select the company you could use an XPath expression like this:
"//Company[Identification/BranchId = '373287734']"
Then, using the returned company element as context, you can get the element to be updated with another XPath expression:
"//EntryId[position() = 1]"

Reading Huge XML File using StAX and XPath

The input file contains thousands of transactions in XML format which is around 10GB of size. The requirement is to pick each transaction XML based on the user input and send it to processing system.
The sample content of the file
<transactions>
<txn id="1">
<name> product 1</name>
<price>29.99</price>
</txn>
<txn id="2">
<name> product 2</name>
<price>59.59</price>
</txn>
</transactions>
The (technical)user is expected to give the input tag name like <txn>.
We would like to provide this solution to be more generic. The file content might be different and users can give a XPath expression like "//transactions/txn" to pick individual transactions.
There are few technical things we have to consider here
The file can be in a shared location or FTP
Since the file size is huge, we can't load the entire file in JVM
Can we use StAX parser for this scenario? It has to take XPath expression as a input and pick/select transaction XML.
Looking for suggestions. Thanks in advance.

If performance is an important factor, and/or the document size is large (both of which seem to be the case here), the difference between an event parser (like SAX or StAX) and the native Java XPath implementation is that the latter builds a W3C DOM Document prior to evaluating the XPath expression. [It's interesting to note that all Java Document Object Model implementations like the DOM or Axiom use an event processor (like SAX or StAX) to build the in-memory representation, so if you can ever get by with only the event processor you're saving both memory and the time it takes to build a DOM.]
As I mentioned, the XPath implementation in the JDK operates upon a W3C DOM Document. You can see this in the Java JDK source code implementation by looking at com.sun.org.apache.xpath.internal.jaxp.XPathImpl, where prior to the evaluate() method being called the parser must first parse the source:
Document document = getParser().parse( source );
After this your 10GB of XML will be represented in memory (plus whatever overhead) — probably not what you want. While you may want a more "generic" solution, both your example XPath and your XML markup seem relatively simple, so there doesn't seem to be a really strong justification for an XPath (except perhaps programming elegance). The same would be true for the XProc suggestion: this would also build a DOM. If you truly need a DOM you could use Axiom rather than the W3C DOM. Axiom has a much friendlier API and builds its DOM over StAX, so it's fast, and uses Jaxen for its XPath implementation. Jaxen requires some kind of DOM (W3C DOM, DOM4J, or JDOM). This will be true of all XPath implementations, so if you don't truly need XPath sticking with just the events parser would be recommended.
SAX is the old streaming API, with StAX newer, and a great deal faster. Either using the native JDK StAX implementation (javax.xml.stream) or the Woodstox StAX implementation (which is significantly faster, in my experience), I'd recommend creating a XML event filter that first matches on element type name (to capture your <txn> elements). This will create small bursts of events (element, attribute, text) that can be checked for your matching user values. Upon a suitable match you could either pull the necessary information from the events or pipe the bounded events to build a mini-DOM from them if you found the result was easier to navigate. But it sounds like that might be overkill if the markup is simple.
This would likely be the simplest, fastest possible approach and avoid the memory overhead of building a DOM. If you passed the names of the element and attribute to the filter (so that your matching algorithm is configurable) you could make it relatively generic.

Stax and xpath are very different things. Stax allows you to parse a streaming XML document in a forward direction only. Xpath allows parsing in both directions. Stax is a very fast streaming XML parser, but, if you want xpath, java has a separate library for that.
Take a look at this question for a very similar discussion: Is there any XPath processor for SAX model?

We regularly parse 1GB+ complex XML files by using a SAX parser which does exactly what you described: It extracts partial DOM trees that can be conveniently queried using XPATH.
I blogged about it here - It's using a SAX not a StAX parser, but may be worth a look at.

It's definitely a use case for XProc with a streaming and parallel processing implementation like QuiXProc (http://code.google.com/p/quixproc)
In this situation, you will have to use
<p:for-each>
<p:iteration-source select="//transactions/txn"/>
<!-- you processing on a small file -->
</p:for-each>
You can even wrapp each of the resulting transformation with a single line of XProc
<p:wrap-sequence wrapper="transactions"/>
Hope this helps

A fun solution for processing huge XML files >10GB.
Use ANTLR to create byte offsets for the parts of interest. This will save some memory compared with a DOM based approach.
Use Jaxb to read parts from byte position
Find details at the example of wikipedia dumps (17GB) in this SO answer https://stackoverflow.com/a/43367629/1485527

Streaming Transformations for XML (STX) might be what you need.

Do you need to process it fast or you need fast lookups in the data ? These requirements need different approach.
For fast reading of the whole data StAX will be OK.
If you need fast lookups than you could need to load it to some database, Berkeley DB XML e.g.

A nice Java XML DOM utility

I find myself writing the same verbose DOM manipulation code again and again:
Element e1 = document.createElement("some-name");
e1.setAttribute("attr1", "val1");
e2.setAttribute("attr2", "val2");
document.appendChild(e1);
Element e2 = document.createElement("some-other-name");
e.appendChild(e2);
// Etc, the same for attributes and finding the nodes again:
Element e3 = (Element) document.getElementsByTagName("some-other-name").item(0);
Now, I don't want to switch architecture all together, i.e. I don't want to use JDOM, JAXB, or anything else. Just Java's org.w3c.dom. The reasons for this are
It's about an old and big legacy system
The XML is used in many places and XSLT transformed several times to get XML, HTML, PDF output
I'm just looking for convenience, not a big change.
I'm just wondering if there is a nice wrapper library (e.g. with apache commons or google) that allows me to do things like this with a fluent style similar to jRTF:
// create a wrapper around my DOM document and manipulate it:
// like in jRTF, this code would make use of static imports
dom(document).add(
element("some-name")
.attr("attr1", "val1")
.attr("attr2", "val2")
.add(element("some-other-name")),
element("more-elements")
);
and then
Element e3 = dom(document).findOne("some-other-name");
The important requirement I have here is that I explicitly want to operate on a org.w3c.dom.Document that
already exists
is pretty big
needs quite a bit of manipulation
So transforming the org.w3c.dom.Document into JDOM, dom4j, etc seems like a bad idea. Wrapping it with adapters is what I'd prefer.
If it doesn't exist, I might roll my own, as this jRTF syntax looks really nice! And for XML, it seems quite easy to implement, as there are only few node types. This could become as powerful as jquery from the fluent API perspective!

To elaborate my comment, Dom4J gets you pretty close to what you wanted:
final Document dom = DocumentHelper.createDocument().addElement("some-name")
.addAttribute("attr1", "val1")
.addAttribute("attr2", "val2")
.addElement("some-other-name").getDocument();
System.out.println(dom.asXML());
Output:
<?xml version="1.0" encoding="UTF-8"?>
<some-name attr1="val1" attr2="val2"><some-other-name/></some-name>
I know it's not native DOM, but it's very similar and it has very nice features for Java developers (element iterators, live element lists etc.)

I found some tools that roughly do what I asked for in my question:
http://code.google.com/p/xmltool/
http://jsoup.org/
However, in the mean time, I am more inclinded to roll my own. I'm really a big fan of jquery, and I think jquery can be mapped to a Java fluent API:
http://www.jooq.org/products/jOOX

Well, this is maybe silly but why don't you implement that little API on your own? I'm sure you know DOM API pretty well and it won't take much time to implement what you want.
Btw consider using XPath for manipulation with document (you can also implement your mini-api over this one).

Interpret a rule applying multiple xpath queries on multiple XML documents

I need to build a component which would take a few XML documents in input and check the following kind of rules:
XML1:/bookstore/book[price>35.00] != null
and (XML2:/city/name = 'Montreal'
or XML3://customer[#language] contains 'en')
Basically my component should be able to:
substitute the XML tokens with the corresponding XML document(before colon)
apply xpath query on this XML document
check the xpath output against expected result ("=", "!=", "contains")
follow the basic syntax ("and", "or" and parentheses)
tell if the rule is true or false
Do you know any library which could help me? maybe JavaCC?
Thanks

For evaluating XPATHs I recommend JAXEN.
Jaxen is an open source XPath library
written in Java. It is adaptable to
many different object models,
including DOM, XOM, dom4j, and JDOM.
Is it also possible to write adapters
that treat non-XML trees such as
compiled Java byte code or Java beans
as XML, thus enabling you to query
these trees with XPath too.
The Java XPath API (Java 5 / javax.xml.xpath) is also an option, but I haven't tried it yet.

Somebody on the JavaCC mailing list pointed me to the right direction, mentioning Schematron. It led me to Probatron which seems to be the best java implementation available.
Schematron web site claims that the language supports "jump across links and between XML documents to check constraints" but it seems Probatron doesn't allow that. I may not to tweak it or find a trick for that (like building a temporary XML document containing all my source documents). Apart from that, it looks Probatron is the right library for me.

Search XML files with xalan in Java

I need to write a java application that does a keyword search within the tags and the actual data from many xml files. From my research online I get the feeling i have to use xalan, but I can't figure out how to use it or what it does. Could somebody point me in the right direction? Thanks

The first thing you need to do is to decide what data you're actually going to search. You say "within the tags and actual data" -- does that mean that you'll do a keyword search for an element name? Or an element name and content within it?
Depending on how complex your search queries are, you'll probably want to turn to a real search engine, like Lucene. I will say, however, that before you take this step you need to give a lot of thought to how you plan to search, so that you build an appropriate index.
If your search requirements are simpler, you could load the documents into a DOM and use XPath. I'd suggest trying this out before moving to Lucene.
You don't need Xalan; the JDK comes with XML parsers and an XPath evaluator. I've written a couple of articles on using them: (parsing), (xpath).

Xalan is an XSLT processor: it enables you to write an XSL stylesheet that will transform your source XML document into something else.
Sure may write an XSL transform and then you search the result of the transform.
Another option is to parse the document with an XML parser and then use Lucene: see Parsing, indexing, and searching XML documents with Digester and Lucene.
You may also want to use XPath. It all depends on what exactly you want to achieve.

I sounds like you are looking for an XPath implementation for Java. This allows you to construct a search expression and apply it to one or more XML documents (which generally have to have been parsed). Xalan is one option, but there are others. Versions of Java starting with Java 5 have included XML parsing and XPath capabilities. If you are using a recent version of Java, and want to simply parse and search through a set of XML documents, then you likely need nothing besides the Java SDK.
See this article for a good (but somewhat dated) overview of the XPath capabilities that come "out of the box": http://www.ibm.com/developerworks/library/x-javaxpathapi.html

See this SO post on how to do a search using the contains() XPath function.
As for an example on how to do an XPath query, I suggest looking at the Java XPath documentation. Here's the example code they provide:
XPath xpath = XPathFactory.newInstance().newXPath();
String expression = "/widgets/widget";
InputSource inputSource = new InputSource("widgets.xml");
NodeSet nodes = (NodeSet) xpath.evaluate(expression, inputSource, XPathConstants.NODESET);
This would load the file widgets.xml and return a NodeSet of all nodes matching the expression.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.