A nice Java XML DOM utility

A nice Java XML DOM utility - java

I find myself writing the same verbose DOM manipulation code again and again:
Element e1 = document.createElement("some-name");
e1.setAttribute("attr1", "val1");
e2.setAttribute("attr2", "val2");
document.appendChild(e1);
Element e2 = document.createElement("some-other-name");
e.appendChild(e2);
// Etc, the same for attributes and finding the nodes again:
Element e3 = (Element) document.getElementsByTagName("some-other-name").item(0);
Now, I don't want to switch architecture all together, i.e. I don't want to use JDOM, JAXB, or anything else. Just Java's org.w3c.dom. The reasons for this are
It's about an old and big legacy system
The XML is used in many places and XSLT transformed several times to get XML, HTML, PDF output
I'm just looking for convenience, not a big change.
I'm just wondering if there is a nice wrapper library (e.g. with apache commons or google) that allows me to do things like this with a fluent style similar to jRTF:
// create a wrapper around my DOM document and manipulate it:
// like in jRTF, this code would make use of static imports
dom(document).add(
element("some-name")
.attr("attr1", "val1")
.attr("attr2", "val2")
.add(element("some-other-name")),
element("more-elements")
);
and then
Element e3 = dom(document).findOne("some-other-name");
The important requirement I have here is that I explicitly want to operate on a org.w3c.dom.Document that
already exists
is pretty big
needs quite a bit of manipulation
So transforming the org.w3c.dom.Document into JDOM, dom4j, etc seems like a bad idea. Wrapping it with adapters is what I'd prefer.
If it doesn't exist, I might roll my own, as this jRTF syntax looks really nice! And for XML, it seems quite easy to implement, as there are only few node types. This could become as powerful as jquery from the fluent API perspective!

To elaborate my comment, Dom4J gets you pretty close to what you wanted:
final Document dom = DocumentHelper.createDocument().addElement("some-name")
.addAttribute("attr1", "val1")
.addAttribute("attr2", "val2")
.addElement("some-other-name").getDocument();
System.out.println(dom.asXML());
Output:
<?xml version="1.0" encoding="UTF-8"?>
<some-name attr1="val1" attr2="val2"><some-other-name/></some-name>
I know it's not native DOM, but it's very similar and it has very nice features for Java developers (element iterators, live element lists etc.)

I found some tools that roughly do what I asked for in my question:
http://code.google.com/p/xmltool/
http://jsoup.org/
However, in the mean time, I am more inclinded to roll my own. I'm really a big fan of jquery, and I think jquery can be mapped to a Java fluent API:
http://www.jooq.org/products/jOOX

Well, this is maybe silly but why don't you implement that little API on your own? I'm sure you know DOM API pretty well and it won't take much time to implement what you want.
Btw consider using XPath for manipulation with document (you can also implement your mini-api over this one).

Related

Dom4j vs JAXB for reading and updating large and complex XML files

I have an XML file with a stable tree structure and more than 5000 elements.
A fraction of it is below:
<Companies>
<Offices>
<RevenueInfo>
<TransactionId>14042015014606877</TransactionId>
<Company>
<Identification>
<GlobalId>25142400905</GlobalId>
<BranchId>373287734</BranchId>
<GeoId>874</GeoId>
<LastUpdated>2015-04-14T01:46:06.940</LastUpdated>
<RecordType>7785</RecordType>
</Identification>
<Info>
<DataEntry>
<EntryId>12345</EntryId>
</DataEntry>
<DataEntry>
<EntryId>34567</EntryId>
</DataEntry>
<DataEntry>
<EntryId>89076</EntryId>
</DataEntry>
<DataEntry>
<EntryId>13211</EntryId>
</DataEntry>
</Info>
...more elements
</Company>
</RevenueInfo>
</Offices>
</Companies>
I need to be able to update any of the values in the document based on user input and create a new XML file with the updated information. User will pass BranchId, the name of the element to update and it's number of order if multiple occurring element ( for example, for EntryId 12345 the user will pass 373287734 EntryId=1 010101 )
I've been looking at JAXB but it seems like a considerable effort to create the model classes for this kind of XML but it also seems like it would make printing to file and locating the element to update a lot easier.
Dom4j seems to have good performance results too, but not sure how parsing will be.
My question is, is JAXB the best approach in this case or can you suggest a better way to parse this type of XML?

In my experience JAXB only works well when the schema is simple and stable. In other cases you are better off using a generic tree model. The main generic models in the Java world are DOM, JDOM2, DOM4J, XOM, AXIOM. My own preferences are JDOM2 and XOM; DOM4J seems to me overcomplex, and somewhat old-fashioned. But it depends what you are looking for.
But then, the application you describe looks an ideal candidate for an "XML end-to-end" or XRX approach - XForms, XSLT, XQuery, XProc. You don't need Java at all.

Leaving performance and memory requirements aside, I would recommend trying XPath together with DOM4J (or JDOM, or even plain DOM). To select the company you could use an XPath expression like this:
"//Company[Identification/BranchId = '373287734']"
Then, using the returned company element as context, you can get the element to be updated with another XPath expression:
"//EntryId[position() = 1]"

Java XML Reading an XML file from top to bottom

I want to read an XML starting from the top to the bottom using Java. However, I don't want to use recursive functions because I want to be able to jump to a different element and start reading from that position.
I've tried using getParent() and indexOf() methods (All three libraries below have these methods) to do this, but it's gotten very messy, mainly because the methods don't distinguish between attributes and elements.
I'm sure there must be a simple way to do this, but after trying dom4j, jdom, and xom, I still have not found a solution.
[Edit] More Information:
My friend wants to make a console text-based game in a question/answer type style. Instead of hard-coding it into java, I decided to try and make it read an XML file instead, because XML has a tree-like style that would be convenient. Here is an example of what my XML file might look like:
<disp>Text to be displayed</disp>
<disp>Text to be displayed afterward</disp>
<disp>What is your favorite color?</disp>
<question>
<answer name="orange">
<disp>Good choice.</disp>
<!-- More questions and stuff -->
</answer>
<default>
<disp>Wrong. The correct answer was orange.</disp>
</default>
</question>
I don't know if it taboo to use XML like an pseudo programming language. If anyone has other suggestions feel free to give them.

Your design is basically good and an example of Declarative Programming. You should read your XML files using an XML parser either into a DOM or using SAX. Since I think you will want to revisit nodes I suspect you will need a DOM (FWIW I use XOM, xom.nu). One of the best examples of XML-based declarative programming is XSLT where the data and commands are all XML.
I use this model a great deal. It has the advantage that the data structure can be external and can be edited.
(Note that your XML needs a root element)
but it's gotten very messy, mainly because the methods don't
distinguish between attributes and elements.
All DOM or SAX tools differentiate very clearly between attributes and elements, so if there is confusion it is somewhere else.

It would be good if you would show what you want to do with the XML snippets, but normally if you want to read off of any kind of file, use java.util.Scanner.
Scanner scan = new Scanner (new File("file.xml"));
while (scan.hasNext()) {
String theData = scan.nextLine();
}
scan.close();
This should return the values that you need until it runs out of lines to scan.
Hope it works and Happy Coding!

Since you want to read a xml file. I recommend go for SAX parser. It is event based parser very fast and efficient for xml reading (top down approach).
http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/ will explain about usage of sax parser.
Thanks

XML Parsing / Dom Manipulation in Java

I'm trying to figure out how best to translate this:
<Source><properties>
....
<name>wer</name>
<delay>
<type>Deterministic</type>
<parameters length="1">
<param value="78" type="Time"/>
</parameters>
</delay>
<batchSize>
<type>Cauchy</type>
<parameters length="2">
<param value="23" type="Alpha"/>
<param value="7878" type="Beta"/>
</parameters>
</batchSize>
...
</properties></Source>
Into:
<Source><properties>
....
<name>wer</name>
<delay>
<Deterministic Time="78"/>
</delay>
<batchSize>
<Cauchy Alpha="23" Beta="7878"/>
</batchSize>
........
</properties></Source>
I've tried using DocumentBuilderFactory, but I while I can access the value of the name tag, I cannot access the values in the delay/batch section. This is code I used
Element prop = (Element)propertyNode;
NodeList nodeIDProperties = prop.getElementsByTagName("name");
Element nameElement = (Element)nodeIDProperties.item(0);
NodeList textFNList = nameElement.getChildNodes();
String nodeNameValue = ((org.w3c.dom.Node)textFNList.item(0)).getNodeValue().trim();
//--------
NodeList delayNode = prop.getElementsByTagName("delay");
Calling getElementByName("type") or "parameters" doesn't seem to return anything I can work with. Am I missing something, or is there a cleaner way to process the exisiting xml.
The need to be in the defined format to allow for marshalling and unmarshalling by Castor.
Any help would be much appreciated.

There are a variety of ways to convert the XML.
1) You can use XSLT (XSL Transformations) to transform the XML. It is a XML based language used to transform XML documents in other XML documents, text, or HTML. The syntax is hard to learn. However it is a powerful tool for XML conversion. Here is a tutorial. For using XSLT with Java I would recommend Saxon which also comes with a nice documentation. The big plus using XSLT is that the conversion can be externalized in a seperate template. So your Java code is not obfuscated by the translation stuff. However, as mentioned the learning curve is definitly steeper.
2) You can use XPath to select the nodes easily. XPath is a query language for selecting nodes in a XML document. XPath is also used in XSLT by the way. E.g. the XPath query
//delay[type = 'Deterministic']/parameters/param/#value
selects all parameters value which are contained in a node param which are a child of delay containing a node type with the value "Deterministic". Here is a nice web application for testing XPath queries. Here is a tutorial how to use XPath in Java and here about XPath in general. You can use XPath expressions to select the right nodes in your Java code. IMHO this is far more readable and maintainable than using the DOM object model directly (which is also from time to time ackward as you have already learned).
3) You can use Smooks for doing XML transformations. This is especially useful if the transformation gets rather complex. Smooks populates a object model from the input XML and outputs the result XML via a templating mechanism either using Freemarker or XSL templates. Smooks has a very high througput and is used in high performance environments like ESBs (e.g. JBoss ESB, Apache ServiceMix). Might be overpowered for yur scenario though.
4) You could use Freemarker to do the transformation. I have no experience in this, but as I heared it can be used fairly simple. See the "Declarative XML processing" section of the documentation (also take a look at "Exposing XML documents" to learn how to read the source XML). Seems fairly simple to me. If you try your luck with this approach, I would love to hear about it.

This looks like a job for XPATH or some other XML transformation API.
Check out: http://www.ibm.com/developerworks/library/x-javaxpathapi.html

Although probably XSLT is the best way to do this, if you want to use a JVM programming language and you want to learn a different approach, you can try scala's xml transformation library.
Some blog posts:
http://scala.sygneca.com/code/xml-pattern-matching
http://debasishg.blogspot.com/2006/08/xml-integration-in-java-and-scala.html

XSLT was the way forward in the end. Its actually pretty easy to use and the w3schools example is a good place to start.

Error-tolerant XML parsing in Scala

I would like to be able to parse XML that isn't necessarily well-formed. I'd be looking for a fuzzy rather than a strict parser, able to recover from badly nested tags, for example. I could write my own but it's worth asking here first.
Update:
What I'm trying to do is extract links and other info from HTML. In the case of well-formed XML I can use the Scala XML API. In the case of ill-formed XML, it would be nice to somehow convert it into correct XML (somehow) and deal with it the same way, otherwise I'd have to have two completely different sets of functions for dealing with documents.
Obviously because the input is not well-formed and I'm trying to create a well-formed tree, there would have to be some heuristic involved (such as when you see <parent><child></parent> you would close the <child> first and when you then see a <child> you ignore it). But of course this isn't a proper grammar and so there's no correct way of doing it.

What you're looking for would not be an XML parser. XML is very strict about nesting, closing, etc. One of the other answers suggests Tag Soup. This is a good suggestion, though technically it is much closer to a lexer than a parser. If all you want from XML-ish content is an event stream without any validation, then it's almost trivial to roll your own solution. Just loop through the input, consuming content which matches regular expressions along the way (this is exactly what Tag Soup does).
The problem is that a lexer is not going to be able to give you many of the features you want from a parser (e.g. production of a tree-based representation of the input). You have to implement that logic yourself because there is no way that such a "lenient" parser would be able to determine how to handle cases like the following:
<parent>
<child>
</parent>
</child>
Think about it: what sort of tree would expect to get out of this? There's really no sane answer to that question, which is precisely why a parser isn't going to be of much help.
Now, that's not to say that you couldn't use Tag Soup (or your own hand-written lexer) to produce some sort of tree structure based on this input, but the implementation would be very fragile. With tree-oriented formats like XML, you really have no choice but to be strict, otherwise it becomes nearly impossible to get a reasonable result (this is part of why browsers have such a hard time with compatibility).

Try the parser on the XHtml object. It is much more lenient than the one on XML.

Take a look at htmlcleaner. I have used it successfully to convert "HTML from the wild" to valid XML.

Try Tag Soup.
JTidy does something similar but only for HTML.

I mostly agree with Daniel Spiewak's answer. This is just another way to create "your own parser".
While I don't know of any Scala specific solution, you can try using Woodstox, a Java library that implements the StAX API. (Being an even-based API, I am assuming it will be more fault tolerant than a DOM parser)
There is also a Scala wrapper around Woodstox called Frostbridge, developed by the same guy who made the Simple Build Tool for Scala.
I had mixed opinions about Frostbridge when I tried it, but perhaps it is more suitable for your purposes.

I agree with the answers that turning invalid XML into "correct" XML is impossible.
Why don't you just do a regular text search for the hrefs if that's all you're interested in? One issue would be commented out links, but if the XML is invalid, it might not be possible to tell what is intended to be commented out!

Caucho has a JAXP compliant XML parser that is a little bit more tolerant than what you would usually expect. (Including support for dealing with escaped character entity references, AFAIK.)
Find JavaDoc for the parsers here

A related topic (with my solution) is listed below:
Scala and html parsing

Is it essential that I use libraries to manipulate XML?

I am using Java back end for creating an XML string which is passed to the browser. Currently I am using simple string manipulation to produce this XML. Is it essential that I use some XML library in Java to produce the XML string?
I find the libraries very difficult to use compared to what I need.

It's not essential, but advisable. However, if string manipulation works for you, then go for it! There are plenty of cases where small or simple XML text can be safely built by hand.
Just be aware that creating XML text is harder than it looks. Here's some criteria I would consider:
First: how much control do you have on the information that goes into the xml?
The less control you have on the source data, the more likely you will have trouble, and the more advantageous the library becomes. For example: (a) Can you guarantee that the element names will never have a character that is illegal in a name? (b) How about quotes in an attribute's content? Can they happen, and are you handling them? (c) Does the data ever contain anything that might need to be encoded as an entity (like the less-than which often needs to be output as <); are you doing it correctly?
Second, maintainability: is the code that builds the XML easy to understand by someone else?
You probably don't want to be stuck with the code for life. I've worked with second-hand C++ code that hand-builds XML and it can be surprisingly obscure. Of course, if this is a personal project of yours, then you don't need to worry about "others": substitute "in a year" for "others" above.
I wouldn't worry about performance. If your XML is simple enough that you can hand-write it, any overhead from the library is probably meaningless. Of course, your case might be different, but you should measure to prove it first.
Finally, Yes; you can hand build XML text by hand if it's simple enough; but not knowing the libraries available is probably not the right reason.
A modern XML library is a quite powerful tool, but it can also be daunting. However, learning the essentials of your XML library is not that hard, and it can be quite handy; among other things, it's almost a requisite in today's job marketplace. Just don't get bogged down by namespaces, schemas and other fancier features until you get the essentials.
Good luck.

Xml is hard. Parsing yourself is a bad idea, it's even a worse idea to generate content yourself. Have a look at the Xml 1.1 spec.
You have to deal with such things as proper encoding, attribute encoding (e.g., produces invalid xml), proper CDATA escaping, UTF encoding, custom DTD entities, and that's without throwing in the mix xml namespaces with the default / empty namespace, namespace attributes, etc.
Learn a toolkit, there's plenty available.

I think that custom string manipulation is fine, but you have to keep two things in mind:
Your code isn't as mature as the library. Allocate time in your plan to handle the bugs that pop-up.
Your approach will probably not scale as well as a 3rd party library when the xml starts to grow (both in terms of performance and ease of use).
I know a code base that uses custom string manipulation for xml output (and a 3rd party library for input). It was fine to begin with but became a real hassle after a while.

Yes, use the library.
Somebody took the time and effort to create something that is usually better than what you could come up with. String manipulation is for sending back a single node, but once you start needing to manipulate the DOM, or use an XPath query, the library will save you.

By not using a library, you risk generating or parsing data that isn't well-formed, which sooner or later will happen. For the same reason document.write isn't allowed in XHTML, you shouldn't write your XML markup as a string.

Yes.
It makes no sense to skip essential tool: even writing xml is non-trivial with having to escape those ampersands and lts, not to mention namespace bindings (if needed).
And in the end libs can generally read and write xml not only more reliably but more efficiently (esp. so for Java).
But you may have been looking at wrong tools, if they seem overcomplicated. Data binding using JAXB or XStream is simple; but for simple straight-forward XML output, I go with StaxMate. It can actually simplify the task in many ways (automatically closes start tags, writes namespace declarations if needde etc).

No - If you can parse it yourself (as you are doing), and it will scale for your needs, you do not need any library.
Just ensure that your future needs are going to be met - complex xml creation is better done using libraries - some of which come in very simple flavors too.

The only time I've done something like this in production code was when a collegue and I built a pre-processor so that we could embed XML fragments from other files into a larger XML. On load we would first parse these embed (file references in XML comment strings) and replace them with the actual fragment they referenced. Then we would pass on the combined result to the XML Parser.

You don't have to use library to parse XML, but check out this question What considerations should be made before reinventing the wheel?
before you start writing your own code for parsing/generating xml.

No - especially for generating (parsing I would be less inclined to as input text can always surprise you). I think its fine - but be prepared to shift to a library should you find yourself spending more then a few minutes maintaining your own code.

I don't think that using the DOM XML API wich comes with the JDK is difficult, it's easy to create Element nodes, attributes, etc... and later is easy convert strings to a DOM document sor DOM documents into a String
In the first page google finds from Spain (spanish XML example):
public String DOM2String(Document doc)
{
TransformerFactory transformerFactory =TransformerFactory.newInstance();
Transformer transformer = null;
try{
transformer = transformerFactory.newTransformer();
}catch (javax.xml.transform.TransformerConfigurationException error){
coderror=123;
msgerror=error.getMessage();
return null;
}
Source source = new DOMSource(doc);
StringWriter writer = new StringWriter();
Result result = new StreamResult(writer);
try{
transformer.transform(source,result);
}catch (javax.xml.transform.TransformerException error){
coderror=123;
msgerror=error.getMessage();
return null;
}
String s = writer.toString();
return s;
}
public Document string2DOM(String s)
{
Document tmpX=null;
DocumentBuilder builder = null;
try{
builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
}catch(javax.xml.parsers.ParserConfigurationException error){
coderror=10;
msgerror="Error crando factory String2DOM "+error.getMessage();
return null;
}
try{
tmpX=builder.parse(new ByteArrayInputStream(s.getBytes()));
}catch(org.xml.sax.SAXException error){
coderror=10;
msgerror="Error parseo SAX String2DOM "+error.getMessage();
return null;
}catch(IOException error){
coderror=10;
msgerror="Error generando Bytes String2DOM "+error.getMessage();
return null;
}
return tmpX;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.