TASK : I have an existing xml document (UTF-8) which uses xml namespaces and xml schema. I need to parse to a particular element, append content (that also needs to use xml namespace prefixes) to this element and then write out the Document again.
which is the best XML parser library that I should be using for this TASK ?
I've seen a previous thread (Best XML parser for Java) but was not sure if dom4j or JDOM is any good for namespaces/xmlSchema and good support for UTF-8 characters.
Some parsers that seems like a task for
JDom
Dom4J
XOM
WoodStock
Any idea which one is the best ? :-) I use JDK 6 and would prefer NOT to use the built-in SAX/DOM facilities to do this job because that requires me to write too much code.
Would help to have some examples of doing such a task.
Using JDOM, taking an InputStream and making it a Document:
InputStream inputStream = (InputStream)httpURLConnection.getContent();
DocumentBuilderFactory docbf = DocumentBuilderFactory.newInstance();
docbf.setNamespaceAware(true);
DocumentBuilder docbuilder = docbf.newDocumentBuilder();
Document document = docbuilder.parse(inputStream, baseUrl);
At that point, you have the XML in a Java object. Done. Easy.
You can either use the document object and the Java API to just walk through it, or also use XPath, which I find easier (once I learned it).
Build an XPath object, which takes a bit:
public static XPath buildXPath() {
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
xpath.setNamespaceContext(new AtomNamespaceContext());
return xpath;
}
public class AtomNamespaceContext implements NamespaceContext {
public String getNamespaceURI(String prefix) {
if (prefix == null)
throw new NullPointerException("Null prefix");
else if ("a".equals(prefix))
return "http://www.w3.org/2005/Atom";
else if ("app".equals(prefix))
return "http://www.w3.org/2007/app";
else if ("os".equals(prefix))
return "http://a9.com/-/spec/opensearch/1.1/";
else if ("x".equals(prefix))
return "http://www.w3.org/1999/xhtml";
else if ("xml".equals(prefix))
return XMLConstants.XML_NS_URI;
return XMLConstants.NULL_NS_URI;
}
// This method isn't necessary for XPath processing.
public String getPrefix(String uri) {
throw new UnsupportedOperationException();
}
// This method isn't necessary for XPath processing either.
public Iterator getPrefixes(String uri) {
throw new UnsupportedOperationException();
}
}
Then just use it, which (thankfully) doesn't take much time at all:
return Integer.parseInt(xpath.evaluate("/a:feed/os:totalResults/text()", document));
Use XSLT. Seriously. This is a perfect job for it. Just use a copy template to copy everything as is except for the place where you need to add more xml. You can even add the XML by actually writing XML instead of DOM manipulation.
This is the copy template:
<xsl:template match="node() | #*">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
I know a lot of people hate XSLT, but this is a task where it would really shine and take almost no code. Also, you could just use what's in the JDK.
Since writing too much code is the main issue for you, you might want to consider jOOX:
http://code.google.com/p/joox/
I have created jOOX to be a port of jQuery to Java. The underlying technology is Java's standard DOM. Some sample code:
// Find the order at index for and add an element "paid"
$(document).find("orders").children().eq(4)
.append("<paid>true</paid>");
// Find those orders that are paid and flag them as "settled"
$(document).find("orders").children().find("paid")
.after("<settled>true</settled>");
// Add a complex element
$(document).find("orders").append(
$("order", $("date", "2011-08-14"),
$("amount", "155"),
$("paid", "false"),
$("settled", "false")).attr("id", "13");
Note: Namespaces are not yet explicitly supported, but you can work around that
It sounds like you can write an xslt style sheet to do what you want.
Related
After researching on google I have not find a working solution for this.
The 'MAVEN by Example' ebook uses the Yahoo weather example. Unfortunately it looks like Yahoo changed their interface. I tried to adapt the java code for this, but get this annoying exception:
exec-maven-plugin:1.5.0:java
Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.5.0:java
Caused by: org.dom4j.XPathException:
Exception occurred evaluting XPath: /query/results/channel/yweather:location/#city.
Exception: XPath expression uses unbound namespace prefix yweather
The xml line itself is:
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="1" yahoo:created="2017-02-13T10:57:34Z" yahoo:lang="en-US">
<results>
<channel>
...
<yweather:location xmlns:yweather="http://xml.weather.yahoo.com/ns/rss/1.0" city="Theale" country="United Kingdom" region=" England"/>
The entire XML can be generated from :
https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20weather.forecast%20where%20woeid%3D91731537
My code (as per the 'MAVEN By Example' ebook, xpath and url modified for the changed Yahoo):
public Weather parse(InputStream inputStream) throws Exception {
Weather weather = new Weather();
SAXReader xmlReader = createXmlReader();
Document doc = xmlReader.read( inputStream );
weather.setCity(doc.valueOf ("//yweather:location/#city") );
// and several more, such as setCountry, setTemp
}
(I'm not an xpath expert, so I tried
/query/results/channel/item/yweather:location/#city
as well, just in case, with the same result.
xmlReader:
public InputStream retrieve(String woeid) throws Exception {
String url = "https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20weather.forecast%20where%20woeid%3D"+woeid; // eg 91731537
URLConnection conn = new URL(url).openConnection();
return conn.getInputStream();
}
and the weather class is just a set of getters and setters
When I try this in this XML tester, it works just fine, but that may be the effect of XPATH-v2 vs Java's v1.
When you evaluate your XPath //yweather:location/#city, the XPath processor has no knowledge of which namespace the yweather prefix is bound to. You'll need to provide that information. Now, you might think "the info is right there in the document!" and you'd be right. But prefixes are just a sort of stand-in (like a variable) for the actual namespace. A namespace can be bound to any prefix you like that follows the prefix naming rules, and can be bound to multiple prefixes as well. Just like the variable name in Java referring to an object is of itself of no importance, and multiple variables could refer to the same object.
For example, if you used XPath //yw:location/#city with the prefix yw bound to namespace http://xml.weather.yahoo.com/ns/rss/1.0, it'd still work the same.
I suggest you use class org.dom4j.xpath.DefaultXPath instead of calling valueOf. Create an instance of it and initialize the namespace context. There's a method setNamespaceURIs that takes a Map from prefixes to namespaces and lets you make the bindings. Bind the above weather namespace (the actual URI) to some prefix of your choosing (may be yweather, but can be anything else you want to use in your actual XPath expression) and then use the instance to evaluate it over the document.
Here's an answer I gave to some question that goes more in-depth about what namespaces and their prefixes really are: https://stackoverflow.com/a/8231272/630136
EDIT: the online XPath tester you used probably does some behind-the-scenes magic to extract the namespaces and their prefixes from the given document and bind those in the XPath processor.
If you look at their sample XML and adjust it like this...
<root xmlns:foo="http://www.foo.org/" xmlns:bar="http://www.bar.org">
<actors>
<actor id="1">Christian Bale</actor>
<actor id="2">Liam Neeson</actor>
<actor id="3">Michael Caine</actor>
</actors>
<foo:singers xmlns:test="http://www.foo.org/">
<test:singer id="4">Tom Waits</test:singer>
<foo:singer id="5">B.B. King</foo:singer>
<foo:singer id="6">Ray Charles</foo:singer>
</foo:singers>
</root>
the XML is semantically equivalent, because the test prefix is bound to the same namespace as foo. The XPath //foo:singer/#id still returns all the right results, so the tool is smart about it. However, it doesn't know what to do with XML...
<root xmlns:foo="http://www.foo.org/" xmlns:bar="http://www.bar.org">
<actors>
<foo:actor id="1">Christian Bale</foo:actor>
<actor id="2">Liam Neeson</actor>
<actor id="3">Michael Caine</actor>
</actors>
<foo:singers xmlns:test="http://www.foo.org/" xmlns:foo="http://www.bar.org">
<test:singer id="4">Tom Waits</test:singer>
<foo:singer id="5">B.B. King</foo:singer>
<foo:singer id="6">Ray Charles</foo:singer>
</foo:singers>
</root>
and XPath //foo:*/#id. The prefix foo is bound to a different namespace in the singers element scope, and now it only returns the ids 5 and 6. Contrast it with this XPath, that doesn't use a prefix but the namespace-uri() function: //*[namespace-uri()='http://www.foo.org/']/#id
That last one returns ids 1 and 4, as expected.
I found the error, it's my unfamiliarity with namespaces. The 'createXmlReader()'
used in my example above is a method that sets the correct namespace, except that I forgot to change it after Yahoo changed the xml. Careful re-reading the Maven-by-example documentation, the generated error, and comparing with the detailed answer given here, it suddenly clicked. The updated code (for the benefit of anyone trying the same example):
private SAXReader createXmlReader() {
Map<String,String> uris = new HashMap<String,String>();
uris.put( "yweather", "http://xml.weather.yahoo.com/ns/rss/1.0" );
DocumentFactory factory = new DocumentFactory();
factory.setXPathNamespaceURIs( uris );
SAXReader xmlReader = new SAXReader();
xmlReader.setDocumentFactory( factory );
return xmlReader;
}
The only change is in the line 'uris.put()'
Originally the namespace was "y", now it is "yweather".
I have to read from large xml files each ranging ~500MB. The batch processes typically 500 such files in each run. I have to extract text nodes from it and at the same time extract xml nodes from it. I used xpath DOM in java for easy of use but that doesn't work due to memory issues as i have limited resources.
I intent to use SAX or stax in java now - the text nodes can be easily extracted but i don't know how to extract xml nodes from xml using sax.
a sample:
<?xml version="1.0"?>
<Library>
<Book name = "ABC">
<Author>John</Author>
<PrintingCompanyDT><Printer>Sam</Printer><Printmachine>Laser</Printmachine>
<AssocPrint>Oreilly</AssocPrint> </PrintingCompanyDT>
</Book>
<Book name = "123">
<Author>Mason</Author>
<PrintingCompanyDTv<Printervkelly</Printer><Printmachine>DOTPrint</Printmachine>
<AssocPrint>Oxford</AssocPrint> </PrintingCompanyDT>
</Book>
</Library>
The expected result:
1)Book: ABC:
Author:John
PrintCompany Detail XML:
<PrintingCompanyDT>
<Printer>Sam</Printer>
<Printmachine>Laser</Printmachine>
<AssocPrint>Oreilly</AssocPrint>
</PrintingCompanyDT>
2) Book: 123
Author : Mason
PrintCompany Detail XML:
<PrintingCompanyDT>
<Printer>kelly</Printer>
<Printmachine>DOTPrint</Printmachine>
<AssocPrint>Oxford</AssocPrint>
</PrintingCompanyDT>
If i try in the regular way of appending characters in public void characters(char ch[], int start, int length) method
I get the below
1)Book: ABC:
Author:John
PrintCompany Detail XML :
Sam
Laser
Oreilly
exactly the content and spaces.
Can somebody suggest how to extract an xml node as it is from a xml file through SAX or StaX parser in java.
I'd be tempted to use XOM for this sort of task rather than SAX or StAX directly. XOM is a tree-based representation similar to DOM or JDOM but it has support for processing XML "twigs" in a kind of semi-streaming fashion, ideal for your kind of case where you have many similar elements that can be processed independently of one another. Also every Node has a toXML method that prints the node as XML.
import nu.xom.*;
public class LibraryProcessor extends NodeFactory {
private Nodes empty = new Nodes();
private bookNum = 0;
/** Called for each closing tag in the XML */
public Nodes finishMakingElement(Element element) {
if("Book".equals(element.getLocalName())) {
bookNum++;
// process the complete Book element ...
processBook(element);
// ... and throw it away
return empty;
} else {
// process other elements (except Book) in the normal way
return super.finishMakingElement(element);
}
}
private void processBook(Element book) {
System.out.println(bookNum + ": " +
book.getAttributeValue("name"));
System.out.println("Author: " +
book.getFirstChildElement("Author").getValue());
System.out.println("PrintCompany Detail XML: " +
book.getFirstChildElement("PrintingCompanyDT").toXML());
}
public static void main(String[] args) throws Exception {
Builder builder = new Builder(new LibraryProcessor());
builder.build(new File(args[0]));
}
}
This will work its way through the XML document, calling processBook once for each Book element in turn. Within processBook you have access to the whole Book XML tree as XOM nodes, but without having to load the entire file into memory in one go - the best of both worlds. The "Factories, Filters, Subclassing, and Streaming" section of the XOM tutorial has more detail on this technique.
This example just shows the most basic bits of the XOM API, but it also provides powerful XPath support if you need to do more complex processing. For example, you can directly access the PrintMachine element within processBook using
Element machine = (Element)book.query("PrintingCompanyDT/PrintMachine").get(0);
or if the structure is not so regular, for example if PrintingCompanyDT is sometimes a direct child of Book and sometimes deeper (e.g. a grandchild) then you can use a query like
Element printingCompanyDT = (Element)book.query(".//PrintingCompanyDT").get(0);
(// being the XPath notation for finding descendants at any level, as opposed to / which looks only for direct children).
Is there a way to set Java's XPath to have a default namespace prefix for expressons? For example, instead of: /html:html/html:head/html:title/text()", the query could be: /html/head/title/text()
While using the namespace prefix works, there has to be a more elegant way.
Sample code snippet of what I'm doing now:
Node node = ... // DOM of a HTML document
XPath xpath = XPathFactory.newInstance().newXPath();
// set to a NamespaceContext that simply returns the prefix "html"
// and namespace URI ""http://www.w3.org/1999/xhtml"
xpath.setNamespaceContext(new HTMLNameSpace());
String expression = "/html:html/html:head/html:title/text()";
String value = xpath.evaluate(query, expression);
Unfortunately, no. There was some talk about defining a default namespace for JxPath a few years ago, but a quick look at the latest docs don't indicate that anything happened. You might want to spends some more time looking through the docs, though.
One thing that you could do, if you really don't care about namespaces, is to parse the document without them. Simply omit the call that you're currently making to DocumentBuilderFactory.setNamespaceAware().
Also, note that your prefix can be anything you want; it doesn't have to match the prefix in the instance document. So you could use h rather than html, and minimize the visual clutter of the prefix.
I haven't actually tried this, but according to the NamespaceContext documentation, the namespace context with the prefix "" (emtpy string) is considered to be the default namespace.
I was a little bit too quick on that one. The XPath evaluator does not invoke the NamespaceContext to resolve the "" prefix, if no prefix is used at all in the XPath expression "/html/head/title/text()". I'm now going into XML details, which I am not 100% sure about, but using an expression like "/:html/:head/:title/text()" works with Sun JDK 1.6.0_16 and the NamespaceContext is asked to resolve an empty prefix (""). Is this really correct and expected behaviour or a bug in Xalan?
I know this question is old but I just spent 3 hours researching trying to solve this problem and #kdgregorys answer helped me out alot. I just wanted to put exactly what I did using kdgregorys answer as a guide.
The problem is that XPath in java doesnt even look for a namespace if you dont have a prefix on your query therefore to map a query to a specific namespace you have to add a prefix to the query. I used an arbitrary prefix to map to the schema name. For this example I will use OP's namespace and query and the prefix abc. Your new expression would look like this:
String expression = "/abc:html/abc:head/abc:title/text()";
Then do the following
1) Make sure your document is set to namespace aware.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
2) Implement a NamespaceContext that will resolve your prefix. This one I took from some other post on SO and modified a bit
.
public class NamespaceResolver implements NamespaceContext {
private final Document document;
public NamespaceResolver(Document document) {
this.document = document;
}
public String getNamespaceURI(String prefix) {
if(prefix.equals("abc")) {
// here is where you set your namespace
return "http://www.w3.org/1999/xhtml";
} else if (prefix.equals(XMLConstants.DEFAULT_NS_PREFIX)) {
return document.lookupNamespaceURI(null);
} else {
return document.lookupNamespaceURI(prefix);
}
}
public String getPrefix(String namespaceURI) {
return document.lookupPrefix(namespaceURI);
}
#SuppressWarnings("rawtypes")
public Iterator getPrefixes(String namespaceURI) {
// not implemented
return null;
}
}
3) When creating your XPath object set your NamespaceContext.
xPath.setNamespaceContext(new NamespaceResolver(document));
Now no matter what the actual schema prefix is you can use your own prefix that will map to the proper schema. So your full code using the class above would look something like this.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
Document document = factory.newDocumentBuilder().parse(sourceDocFile);
XPathFactory xPFactory = XPathFactory.newInstance();
XPath xPath = xPFactory.newXPath();
xPath.setNamespaceContext(new NamespaceResolver(document));
String expression = "/abc:html/abc:head/abc:title/text()";
String value = xpath.evaluate(query, expression);
I'll point out now, that I'm new to using saxon, and I've tried following the docs and examples in the package, but I'm just not having luck with this problem.
Basically, I'm trying to do some xml processing in java using saxon v8. In order to get something working, I took one of the sample files included in the package and modified to my needs. It works so long as I'm not using namespaces, and that is my question. How can I get around the namespace problem? I don't really care to use it, but it exists in my xml, so I either have to use it or ignore it. Either solution is fine.
Anyway, here is my starter code. It doesn't do anything but take an xpath query try to use it against the hard coded xml doc.
public static void main(String[] args) {
String query = args[0];
File XMLStream=null;
String xmlFileName="doc.xml";
OutputStream destStream=System.out;
XQueryExpression exp=null;
Configuration C=new Configuration();
C.setSchemaValidation(false);
C.setValidation(false);
StaticQueryContext SQC=new StaticQueryContext(C);
DynamicQueryContext DQC=new DynamicQueryContext(C);
QueryProcessor processor = new QueryProcessor(SQC);
Properties props=new Properties();
try{
exp=processor.compileQuery(query);
XMLStream=new File(xmlFileName);
InputSource XMLSource=new InputSource(XMLStream.toURI().toString());
SAXSource SAXs=new SAXSource(XMLSource);
DocumentInfo DI=SQC.buildDocument(SAXs);
DQC.setContextNode(DI);
SequenceIterator iter = exp.iterator(DQC);
while(true){
Item i = iter.next();
if(i != null){
System.out.println(i.getStringValue());
}
else break;
}
}
catch (Exception e){
System.err.println(e.getMessage());
}
}
An example XML file is here...
<?xml version="1.0"?>
<ns1:animal xmlns:ns1="http://my.catservice.org/">
<cat>
<catId>8889</catId>
<fedStatus>true</fedStatus>
</cat>
</ns1:animal>
If I run this with a query including the namespace, I get an error. For example:
/ns1:animal/cat/ gives the error: "Prefix ns1 has not been declared".
If I remove the ns1: from the query, it gives me nothing. If I doctor the xml to remove the "ns1:" prepended to "animal" I can run the query /animal/cat/ with success.
Any help would be greatly appreciated. Thanks.
Error message correctly points out that your xpath expression does not indicate what namespace prefix "ns1" means (binds to). Just because document to operate on happens to use binding for "ns1" does not mean it is what should be used: this because in XML, it's the namespace URI that matters, and prefixes are just convenient shortcuts to the real thing.
So: how do you define the binding? There are 2 generic ways; either provide a context that can resolve the prefix, or embed actual URI within XPath expression.
Regarding the first approach, this email from Saxon author mentions JAXP method XPath.setNamespaceContext(), similarly, Jaxen XPath processor FAQ has some sample code that could help
That's not very convenient, as you have to implement NamespaceContext, but once you have an implementation you'll be set.
So the notation approach... let's see: Top Ten Tips to Using XPath and XPointer shows this example:
to match element declared with namespace like:
xmlns:book="http://my.example.org/namespaces/book"
you use XPath name like:
{http://my.example.org/namespaces/book}section
which hopefully is understood by Saxon (or Jaxen).
Finally, I would recommend upgrading to Saxon9 if possible, if you have any trouble using one of above solutions.
If you want to have something working out of the box, you can check out embedding-xquery-in-java. There's github project, which uses Saxon to evaluate some sample XQuery expressions.
Regards
I want to use JDOM to read in an XML file, then use XPath to extract data from the JDOM Document. It creates the Document object fine, but when I use XPath to query the Document for a List of elements, I get nothing.
My XML document has a default namespace defined in the root element. The funny thing is, when I remove the default namespace, it successfully runs the XPath query and returns the elements I want. What else must I do to get my XPath query to return results?
XML:
<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.foo.com">
<dvd id="A">
<title>Lord of the Rings: The Fellowship of the Ring</title>
<length>178</length>
<actor>Ian Holm</actor>
<actor>Elijah Wood</actor>
<actor>Ian McKellen</actor>
</dvd>
<dvd id="B">
<title>The Matrix</title>
<length>136</length>
<actor>Keanu Reeves</actor>
<actor>Laurence Fishburne</actor>
</dvd>
</collection>
Java:
public static void main(String args[]) throws Exception {
SAXBuilder builder = new SAXBuilder();
Document d = builder.build("xpath.xml");
XPath xpath = XPath.newInstance("collection/dvd");
xpath.addNamespace(d.getRootElement().getNamespace());
System.out.println(xpath.selectNodes(d));
}
XPath 1.0 doesn't support the concept of a default namespace (XPath 2.0 does).
Any unprefixed tag is always assumed to be part of the no-name namespace.
When using XPath 1.0 you need something like this:
public static void main(String args[]) throws Exception {
SAXBuilder builder = new SAXBuilder();
Document d = builder.build("xpath.xml");
XPath xpath = XPath.newInstance("x:collection/x:dvd");
xpath.addNamespace("x", d.getRootElement().getNamespaceURI());
System.out.println(xpath.selectNodes(d));
}
I had a similiar problem, but mine was that I had a mixture of XML inputs, some of which had a namespace defined and others that didn't. To simplify my problem I ran the following JDOM snippet after loading the document.
for (Element el : doc.getRootElement().getDescendants(new ElementFilter())) {
if (el.getNamespace() != null) el.setNamespace(null);
}
After removing all the namespaces I was able to use simple getChild("elname") style navigation or simple XPath queries.
I wouldn't recommend this technique as a general solution, but in my case it was definitely useful.
You can also do the following
/*[local-name() = 'collection']/*[local-name() = 'dvd']/
Here is list of useful xpath queries.