So i'm trying to learn some xml parsing here, and I'm getting the hang of it, but for whatever reason, I seem to have to tack on "text()" at the end of each query, otherwise I get null values returned to me. I don't actually understand the function of this "text()" ending but I know it's not necessary and I'm wondering why I can't omit it. Please help! Here is my code:
import org.w3c.dom.*;
import javax.xml.xpath.*;
import javax.xml.parsers.*;
import java.io.IOException;
import org.xml.sax.SAXException;
public class ParseClass
{
public static void main(String[] args)
throws ParserConfigurationException, SAXException,
IOException, XPathExpressionException
{
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse("C:\\Users\\Brandon\\Job\\XPath\\XPath_Sample_Stuff\\catalog.xml");
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("/catalog/book[author='Thurman, Paula']/title/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++)
{
System.out.println(nodes.item(i).getNodeValue());
}
}
}
PS. In case you didn't notice. i'm using XPath and DOM for my parsing.
You're calling getNodeValue on your result, and as this docs show (see the table) it is null for a node of type Element. When you use text(), the returned set now contains nodes of type Text, so you get the results you wanted (i.e. the contents of the title element instead of the element itself).
I'd also suggest seeing this for more info on the usage of text() in xpath.
And if you want to extract the text from your element, directly, you could use getTextContent instead of getNodeValue:
// Will work for both element and text nodes
System.out.println(nodes.item(i).getTextContent());
First of all your Xpath expression is invalid (I am considering it as typo). Attributes are indicated with # so correct xpath will be /catalog/book[#author='Thurman, Paula']/title/text().
/catalog/book[#author='Thurman, Paula']/title/ will match the <title> node from your xml whereas /catalog/book[#author='Thurman, Paula']/title/text() with match the text node of <title> i.e if title node was something like <title>The Godfather</title>the later expression would match The Godfather.
A suggestion: don't use DOM. There are many tree representations of XML available in the Java world (JDOM, XOM, DOM4J) that are vastly more usable than DOM. DOM is full of gotcha's like the one you just encountered, where getNodeValue() on an element returns null. The only reason anyone uses DOM is that (a) it came originally from W3C, and (b) it found its way into the JDK. But that all happened an awfully long time ago, and people have learnt from its design mistakes.
Related
I'm parsing a XML string to generate nodes. Sometimes the tag comes with a namespace & sometimes without namespace. How can I ignore this and
I tried in the following way, but it didnt work.
//NodeList idDetails = doc.getDocumentElement().getElementsByTagNameNS("*", "details");
NodeList idDetails = doc.getElementsByTagName("ns2:details");
Any ideas on how to do it?
First one shall work.
NodeList nodes = doc.getDocumentElement().getElementsByTagNameNS("*", str);
But you have to also call DocumentBuilderFactory.setNamespaceAware(true) for this to work, otherwise namespaces will not be detected.
I am writing a java program in which I am parsing input xml file which looks like this:
...
<ems:DeterminationRequest>
<ems:MessageInformation>
<ns17:MessageID xmlns:ns17="http://www.calheers.ca.gov/EHITSAWSInterfaceCommonSchema">1000225404</ns17:MessageID>
<ns17:MessageTimeStamp xmlns:ns17="http://www.calheers.ca.gov/EHITSAWSInterfaceCommonSchema">2015-07-28T01:17:04</ns17:MessageTimeStamp>
<ns17:SendingSystem xmlns:ns17="http://www.calheers.ca.gov/EHITSAWSInterfaceCommonSchema">CH</ns17:SendingSystem>
<ns17:ReceivingSystem xmlns:ns17="http://www.calheers.ca.gov/EHITSAWSInterfaceCommonSchema">LD</ns17:ReceivingSystem>
<ns17:ServicingFipsCountyCode xmlns:ns17="http://www.calheers.ca.gov/EHITSAWSInterfaceCommonSchema">037</ns17:ServicingFipsCountyCode>
</ems:MessageInformation>
</ems:DeterminationRequest>
...
Now I am trying to get node "ems:MessageInformation" without considering namespace name "ems". So I tried following lines of code:
Document doc = db.parse(new FileInputStream(new File("D:\\test.xml")));
Node element = doc.getDocumentElement().getElementsByTagNameNS("*","MessageInformation").item(0);
System.out.println(element.getNodeName());
But it's giving Null Pointer exception because function is not reading required node. I gone through this link for reference. Can someone tell me what I am doing wrong here?
This is an odd/buggy behaviour in den NodeList implementation returned by
doc.getDocumentElement().getElementsByTagNameNS("*","MessageInformation")
It allows you to access item(0) but returns a null object.
(If you are using a current JDK the NodeList implementation is com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl which lazily loads its items and shows this buggy behaviour).
To prevent the NullPointerException you should first check if the returned NodeList has a length > 0:
NodeList result = doc.getDocumentElement().getElementsByTagNameNS("*","MessageInformation");
if (result.getLength() > 0) {
Node element = (Element)result.item(0);
...
}
Then you need to find out why getElementsByTagNameNS does not return the element.
One possible reason could be that you parsed the document without namespace support. The consequence is that the dom elements don't have namespace information and getElementsByTagNameNS fails.
To turn on namespace support use:
DocumentBuilderFactory.setNamespaceAware(true);
Alternatively without namespace support you could search for
NodeList nl = doc.getDocumentElement().getElementsByTagName("ems:MessageInformation");
I want a small subtree out of a xml file (100 Mb) and need to turn off DTD validation, but I can not find any solution for that.
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("//HEADER");
Node node = (Node) expr.evaluate(new InputSource(new FileReader(file)), XPathConstants.NODE);
I tryed to use DocumentBuilder and turn off the DTD validation but that's so slow.
Thanks,
Joo
The reason why it's so slow is because you are forcing a full scan of all the nodes because your XPath criterion is too vague: //HEADER means that the XPath engine will scan each and every node of your 100MB to select the ones where the node name is HEADER. If you can make the XPath expression more specific, you should see dramatic improvements.
Other than that, the code below is something I had to do to prevent DTD validation in the past.
It forces Xerces as the SAX parser and explicitly sets a number of Xerces specific features. But again this will probably not affect significantly the response time.
import java.io.File;
import java.io.StringReader;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.apache.xerces.jaxp.SAXParserFactoryImpl;
import org.xml.sax.InputSource;
[...]
private static SAXParserFactory spf ;
private static SAXParserFactory spf ;
private BillCooker() throws Exception {
System.setProperty("javax.xml.parsers.SAXParserFactory", "org.apache.xerces.jaxp.SAXParserFactoryImpl" ) ;
spf = SAXParserFactoryImpl.newInstance();
spf.setNamespaceAware(true);
spf.setValidating(false);
spf.setFeature("http://xml.org/sax/features/validation", false);
spf.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
spf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
I trimmed it to leave only the lines relevant to validation
I have the following code:
DocumentBuilder dBuilder = dbFactory_.newDocumentBuilder();
StringReader reader = new StringReader(s);
InputSource inputSource = new InputSource(reader);
Document doc_ = dBuilder.parse(inputSource);
and then I would like to create a new element in that node right under the root node with this code:
Node node = doc_.createElement("New_Node");
node.setNodeValue("New_Node_value");
doc_.getDocumentElement().appendChild(node);
The problem is that the node gets created and appended but the value isn't set. I don't know if I just can't see the value when I look at my xml if its hidden in some way but I don't think that's the case because I've tried to get the node value after the create node call and it returns null.
I'm new to xml and dom and I don't know where the value of the new node is stored. Is it like an attribute?
<New_Node value="New_Node_value" />
or does it put value here:
<New_Node> New_Node_value </New_Node>
Any help would be greatly appreciated,
Thanks, Josh
The following code:
Element node = doc_.createElement("New_Node");
node.setTextContent("This is the content"); //adds content
node.setAttribute("attrib", "attrib_value"); //adds an attribute
produces:
<New_Node attrib="attrib_value">This is the content</New_Node>
Hope this clarifies.
For clarification, when you create nodes use:
Attr x = doc.createAttribute(...);
Comment x = doc.createComment(...);
Element x = doc.createElement(...); // as #dogbane pointed out
Text x = doc.createTextNode(...);
instead of using the generic Node for what you get back from each method. It will make your code easier to read/debug.
Secondly, the getNodeValue() / setNodeValue() methods work differently depending on what type of Node you have. See the summary of the Node class for reference. For an Element, you can't use these methods, although for a Text node you can.
As #dogbane pointed out, use setTextContent() for the text between this element's tags. Note that this will destroy any existing child elements.
This is other solution, in my case this solution is working because the setTextContent() function not exist. I am working with Google Web Toolkit (GWT) (It is a development framework Java) and I am imported the XMLParser library for I can use DOM Parser.
import com.google.gwt.xml.client.XMLParser;
Document doc = XMLParser.createDocument();
Element node = doc.createElement("New_Node");
node.appendChild(doc.createTextNode("value"));
doc.appendChild(node);
The result is:
<New_Node> value </New_Node>
<New_Node value="New_Node_value" />
'value' is an attribute of
New_Node
element, for getting into DOM I suggest you http://www.w3schools.com/htmldom/default.asp
Is there a way to set Java's XPath to have a default namespace prefix for expressons? For example, instead of: /html:html/html:head/html:title/text()", the query could be: /html/head/title/text()
While using the namespace prefix works, there has to be a more elegant way.
Sample code snippet of what I'm doing now:
Node node = ... // DOM of a HTML document
XPath xpath = XPathFactory.newInstance().newXPath();
// set to a NamespaceContext that simply returns the prefix "html"
// and namespace URI ""http://www.w3.org/1999/xhtml"
xpath.setNamespaceContext(new HTMLNameSpace());
String expression = "/html:html/html:head/html:title/text()";
String value = xpath.evaluate(query, expression);
Unfortunately, no. There was some talk about defining a default namespace for JxPath a few years ago, but a quick look at the latest docs don't indicate that anything happened. You might want to spends some more time looking through the docs, though.
One thing that you could do, if you really don't care about namespaces, is to parse the document without them. Simply omit the call that you're currently making to DocumentBuilderFactory.setNamespaceAware().
Also, note that your prefix can be anything you want; it doesn't have to match the prefix in the instance document. So you could use h rather than html, and minimize the visual clutter of the prefix.
I haven't actually tried this, but according to the NamespaceContext documentation, the namespace context with the prefix "" (emtpy string) is considered to be the default namespace.
I was a little bit too quick on that one. The XPath evaluator does not invoke the NamespaceContext to resolve the "" prefix, if no prefix is used at all in the XPath expression "/html/head/title/text()". I'm now going into XML details, which I am not 100% sure about, but using an expression like "/:html/:head/:title/text()" works with Sun JDK 1.6.0_16 and the NamespaceContext is asked to resolve an empty prefix (""). Is this really correct and expected behaviour or a bug in Xalan?
I know this question is old but I just spent 3 hours researching trying to solve this problem and #kdgregorys answer helped me out alot. I just wanted to put exactly what I did using kdgregorys answer as a guide.
The problem is that XPath in java doesnt even look for a namespace if you dont have a prefix on your query therefore to map a query to a specific namespace you have to add a prefix to the query. I used an arbitrary prefix to map to the schema name. For this example I will use OP's namespace and query and the prefix abc. Your new expression would look like this:
String expression = "/abc:html/abc:head/abc:title/text()";
Then do the following
1) Make sure your document is set to namespace aware.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
2) Implement a NamespaceContext that will resolve your prefix. This one I took from some other post on SO and modified a bit
.
public class NamespaceResolver implements NamespaceContext {
private final Document document;
public NamespaceResolver(Document document) {
this.document = document;
}
public String getNamespaceURI(String prefix) {
if(prefix.equals("abc")) {
// here is where you set your namespace
return "http://www.w3.org/1999/xhtml";
} else if (prefix.equals(XMLConstants.DEFAULT_NS_PREFIX)) {
return document.lookupNamespaceURI(null);
} else {
return document.lookupNamespaceURI(prefix);
}
}
public String getPrefix(String namespaceURI) {
return document.lookupPrefix(namespaceURI);
}
#SuppressWarnings("rawtypes")
public Iterator getPrefixes(String namespaceURI) {
// not implemented
return null;
}
}
3) When creating your XPath object set your NamespaceContext.
xPath.setNamespaceContext(new NamespaceResolver(document));
Now no matter what the actual schema prefix is you can use your own prefix that will map to the proper schema. So your full code using the class above would look something like this.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
Document document = factory.newDocumentBuilder().parse(sourceDocFile);
XPathFactory xPFactory = XPathFactory.newInstance();
XPath xPath = xPFactory.newXPath();
xPath.setNamespaceContext(new NamespaceResolver(document));
String expression = "/abc:html/abc:head/abc:title/text()";
String value = xpath.evaluate(query, expression);