Reading all the namespaces in a DOM document - java

I want to read all the namespaces in a DOM document.
My input XML file is:
<a:Sample xmlns:a="http://a.org/"
xmlns:b="http://b.org/">
<a:Element b:Attribute="text"> </a:Element>
</a:Sample>
I want to get all the prefixes with their associated namespaces in the given input XML.
I have a method with the following definition.
public Document check(Document srcfile) {
Document naReport = null;
if(srcfile != null) {
// Parse the document using builder.
if (srcfile instanceof DocumentTraversal) {
DocumentTraversal dt = (DocumentTraversal) srcfile;
NodeIterator i = dt.createNodeIterator(srcfile, NodeFilter.SHOW_ELEMENT, null, false);
System.out.println(srcfile.getPrefix());
System.out.println(srcfile.getNamespaceURI());
Element element = (Element) i.nextNode();
while (element != null) {
String prefix = element.getPrefix();
if (prefix != null) {
String uri = element.getNamespaceURI();
System.out.println("Prefix: " + prefix);
System.out.println("URI: " + uri);
// bindings.put(prefix, uri);
}
element = (Element) i.nextNode();
}
}
}
return naReport;
}
But, when I run my program, I'm getting the following output:
Prefix: a
URI: http://a.org/
Prefix: a
URI: http://a.org/
Could someone help me.

You will need to loop over the attributes of each element inside your main element loop:
NamedNodeMap map = element.getAttributes();
for (int iattr=0; iattr<map.getLength(); iattr++) {
Attr attr = (Attr)map.item(iattr);
if (attr.getNamespaceURI() != null) {
System.out.println("Attr " + attr.getName() + " - " + attr.getNamespaceURI());
}
}

Related

XPath selector for nodes that its ancestors are not a specific node

I'm writing an XPath selector to select all the node name o:OLEObject providing that its ancestor is not w:del. but the nodes in w:del are included in the result. Can you help me to clear it?
Here is my script:
publicList<String> getAllOleObjectId(XmlObject wobj) {
List<String> lstOfOleObjIds = new ArrayList<String>();
XmlCursor cursorForOle = wobj.newCursor();
if(cursorForOle != null) {
cursorForOle.selectPath(
"declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' " +
"declare namespace o='urn:schemas-microsoft-com:office:office' " +
".//*/o:OLEObject[ancestor::*[not(self::w:del)]]"
);
while (cursorForOle.hasNextSelection()) {
cursorForOle.toNextSelection();
XmlObject oleObj = cursorForOle.getObject();
Node oleDomNode = oleObj.getDomNode();
NamedNodeMap domAttrObj = oleDomNode.getAttributes();
lstOfOleObjIds.add(domAttrObj.getNamedItem("r:id").getNodeValue());
}
}
cursorForOle.dispose();
return lstOfOleObjIds;
}
You should replace your XPath with :
//*/o:OLEObject[not(ancestor::w:del)]
Select OLEObject element, child of any (*) element, and which has no ancestor element named del.

Reading XML tags in java, code optimization

What I am actually doing is a recursive function which reads the tags in the xml. Below is the code:
private void readTag(org.w3c.dom.Node item, String histoTags, String fileName, Hashtable<String, String> tagsInfos) {
try {
if (item.getNodeType() == Node.ELEMENT_NODE) {
NodeList itemChilds = item.getChildNodes();
for (int i=0; i < itemChilds.getLength(); i++) {
org.w3c.dom.Node itemChild = itemChilds.item(i);
readTag(itemChild, histoTags + "|" + item.getNodeName(), fileName, tagsInfos);
}
}
else if (item.getNodeType() == Node.TEXT_NODE) {
tagsInfosSoft.put(histoTags, item.getNodeValue());
}
}
This function takes some time to execute. The xml the function reads is in this format:
<?xml version="1.0" encoding="UTF-8"?>
<Document>
<Mouvement>
<Com>
<IdCom>32R01000000772669473</IdCom>
<RefCde>32R</RefCde>
<Edit>0</Edit>
<Com>
<Mouvement>
<Document>
Is there any way of optimizing this code in java?
Two optimizations, don't know how much they will help:
Don't use getChildNodes(). Use getFirstChild() and getNextSibling().
Reuse a single StringBuilder instead of creating a new one for every element (implicitly done by histoTags + "|" + item.getNodeName()).
But, you should also be aware that the text content of an element node may seen as a combination of multiple TEXT and CDATA nodes.
Your code will also work better if it works on elements, not nodes.
private static void readTag(Element elem, StringBuilder histoTags, String fileName, Hashtable<String, String> tagsInfos) {
int histoLen = histoTags.length();
CharSequence textContent = null;
boolean hasChildElement = false;
for (Node child = elem.getFirstChild(); child != null; child = child.getNextSibling()) {
switch (child.getNodeType()) {
case Node.ELEMENT_NODE:
histoTags.append('|').append(child.getNodeName());
readTag((Element)child, histoTags, fileName, tagsInfos);
histoTags.setLength(histoLen);
hasChildElement = true;
break;
case Node.TEXT_NODE:
case Node.CDATA_SECTION_NODE:
//uncomment to test: System.out.println(histoTags + ": \"" + child.getTextContent() + "\"");
if (textContent == null)
// Optimization: Don't copy to a StringBuilder if only one text node will be found
textContent = child.getTextContent();
else if (textContent instanceof StringBuilder)
// Ok, now we need a StringBuilder to collect text from multiple nodes
((StringBuilder)textContent).append(child.getTextContent());
else
// And we keep collecting text from multiple nodes
textContent = new StringBuilder(textContent).append(child.getTextContent());
break;
default:
// ignore all others
}
}
if (textContent != null) {
String text = textContent.toString();
// Suppress pure whitespace content on elements with child elements, i.e. structural whitespace
if (! hasChildElement || ! text.trim().isEmpty())
tagsInfos.put(histoTags.toString(), text);
}
}
Test
String xml = "<root>\n" +
" <tag>hello <![CDATA[world]]> Foo <!-- comment --> Bar</tag>\n" +
"</root>\n";
Element docElem = DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.parse(new InputSource(new StringReader(xml)))
.getDocumentElement();
Hashtable<String, String> tagsInfos = new Hashtable<>();
readTag(docElem, new StringBuilder(docElem.getNodeName()), "fileName", tagsInfos);
System.out.println(tagsInfos);
Output (with print uncommented)
root: "
"
root|tag: "hello "
root|tag: "world"
root|tag: " Foo "
root|tag: " Bar"
root: "
"
{root|tag=hello world Foo Bar}
See how splitting the text inside the <tag> node using CDATA and comments caused the DOM node to contain multiple TEXT/CDATA child nodes.

Empty / Null Nodes returned from getChildNodes

I'm trying to parse the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<docusign-cfg>
<tagConfig>
<tags>
<approve>approve</approve>
<checkbox>checkbox</checkbox>
<company>company</company>
<date>date</date>
<decline>decline</decline>
<email>email</email>
<emailAddress>emailAddress</emailAddress>
<envelopeID>envelopeID</envelopeID>
<firstName>firstName</firstName>
<lastName>lastName</lastName>
<number>number</number>
<ssn>ssn</ssn>
<zip>zip</zip>
<signHere>signHere</signHere>
<checkbox>checkbox</checkbox>
<initialHere>initialHere</initialHere>
<dateSigned>dateSigned</dateSigned>
<fullName>fullName</fullName>
</tags>
</tagConfig>
</docusign-cfg>
I want to read either the name or content of each tag in the <tags> tag. I can do so with the following code:
public String[] getAvailableTags() throws Exception
{
String path = "/docusign-cfg/tagConfig/tags";
XPathFactory f = XPathFactory.newInstance();
XPath x = f.newXPath();
Object result = null;
try
{
XPathExpression expr = x.compile(path);
result = expr.evaluate(doc, XPathConstants.NODE);
}
catch (XPathExpressionException e)
{
throw new Exception("An error ocurred while trying to retrieve the tags");
}
Node node = (Node) result;
NodeList childNodes = node.getChildNodes();
String[] tags = new String[childNodes.getLength()];
System.out.println(tags.length);
for(int i = 0; i < tags.length; i++)
{
String content = childNodes.item(i).getNodeName().trim().replaceAll("\\s", "");
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags[i] = content;
}
}
return tags;
}
After some searching I found that parsing it this way causes it to read whitespace between nodes / tags causes those whitespaces to be read as children. In this case the whitespaces are considered children of <tags> .
My output:
37
null
approve
null
checkbox
null
company
null
date
null
decline
null
email
null
emailAddress
null
envelopeID
null
firstName
null
lastName
null
number
null
ssn
null
zip
null
signHere
null
checkbox
null
initialHere
null
dateSigned
null
fullName
null
37 is the number of nodes it found in <tags>
Everything below 37 is the content of the tag array.
How are these null elements being added to the tag array despite my checking for null?
I think that is because of the indexing of tag. The if check also skips an index. So even though value is not being inserted it will result in null. Use separate index for tag array
int j = 0;
for(int i = 0; i < tags.length; i++)
{
String content = childNodes.item(i).getNodeName().trim().replaceAll("\\s", "");
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags[j++] = content;
}
}
Since you are omitting some of the child nodes, creating an array of entire child nodes length may result in wastage of memory. You can use a List instead. If you are particular about String array you can later convert this to an array as well.
public String[] getAvailableTags() throws Exception
{
String path = "/docusign-cfg/tagConfig/tags";
XPathFactory f = XPathFactory.newInstance();
XPath x = f.newXPath();
Object result = null;
try
{
XPathExpression expr = x.compile(path);
result = expr.evaluate(doc, XPathConstants.NODE);
}
catch (XPathExpressionException e)
{
throw new Exception("An error ocurred while trying to retrieve the tags");
}
Node node = (Node) result;
NodeList childNodes = node.getChildNodes();
List<String> tags = new ArrayList<String>();
for(int i = 0; i < tags.length; i++)
{
String content = childNodes.item(i).getNodeName().trim().replaceAll("\\s", "");
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags.add(content);
}
}
String[] tagsArray = tags.toArray(new String[tags.size()]);
return tagsArray;
}
The contents of tag array defaults to null.
So it is not a case of how does the element become null, it is the case of it being left as null.
To prove this to yourself, add the following else block like this:
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags[i] = content;
} else {
tags[i] = "Foo Bar";
}
You should now see 'Foo Bar' instead of null.
A better solution here would be to use an ArrayList, and append the tags to it instead of using an array. Then you do not need to track the indexes and so less chance of this type of bug.

Convert Iterator to a for loop with index in order to skip objects

I am using Jericho HTML Parser to parse some malformed html. In particular I am trying to get all text nodes, process the text and then replace it.
I want to skip specific elements from processing. For example I want to skip all elements, and any element that has attribute class="noProcess". So, if a div has class="noProcess" then I want to skip this div and all children from processing. However, I do want these skipped elements to return back to the output after processing.
Jericho provides an Iterator for all nodes but I am not sure how to skip complete elements from the Iterator. Here is my code:
private String doProcessHtml(String html) {
Source source = new Source(html);
OutputDocument outputDocument = new OutputDocument(source);
for (Segment segment : source) {
if (segment instanceof Tag) {
Tag tag = (Tag) segment;
System.out.println("FOUND TAG: " + tag.getName());
// DO SOMETHING HERE TO SKIP ENTIRE ELEMENT IF IS <A> OR CLASS="noProcess"
} else if (segment instanceof CharacterReference) {
CharacterReference characterReference = (CharacterReference) segment;
System.out.println("FOUND CHARACTERREFERENCE: " + characterReference.getCharacterReferenceString());
} else {
System.out.println("FOUND PLAIN TEXT: " + segment.toString());
outputDocument.replace(segment, doProcessText(segment.toString()));
}
}
return outputDocument.toString();
}
It doesn't look like using the ignoreWhenParsing() method works for me as the parser just treats the "ignored" element as text.
I was thinking that if I could convert the Iterator loop to a for (int i = 0;...) loop I could probably be able to skip the element and all its children by modifying i to point to the EndTag and then continue the loop.... but not sure.
I think you might want to consider a redesign of the way your segments are built. Is there a way to parse the html in such a way that each segment is a parent element that contains a nested list of child elements? That way you could do something like:
for (Segment segment : source) {
if (segment instanceof Tag) {
Tag tag = (Tag) segment;
System.out.println("FOUND TAG: " + tag.getName());
// DO SOMETHING HERE TO SKIP ENTIRE ELEMENT IF IS <A> OR CLASS="noProcess"
continue;
} else if (segment instanceof CharacterReference) {
CharacterReference characterReference = (CharacterReference) segment;
System.out.println("FOUND CHARACTERREFERENCE: " + characterReference.getCharacterReferenceString());
for(Segment child : segment.childNodes()) {
//Use recursion to process child elements
//You will want to put your for loop in a separate method so it can be called recursively.
}
} else {
System.out.println("FOUND PLAIN TEXT: " + segment.toString());
outputDocument.replace(segment, doProcessText(segment.toString()));
}
}
Without more code to inspect its hard to determine if restructuring the segment element is even possible or worth the effort.
Managed to have a working solution by using the getEnd() method of the Element object of the Tag. The idea is to skip elements if their end position is less than a position you set. So you find the end position of the element you want to exclude and you do not process anything else before that position:
final ArrayList<String> excludeTags = new ArrayList<String>(Arrays.asList(new String[] {"head", "script", "a"}));
final ArrayList<String> excludeClasses = new ArrayList<String>(Arrays.asList(new String[] {"noProcess"}));
Source.LegacyIteratorCompatabilityMode = true;
Source source = new Source(htmlToProcess);
OutputDocument outputDocument = new OutputDocument(source);
int skipToPos = 0;
for (Segment segment : source) {
if (segment.getBegin() >= skipToPos) {
if (segment instanceof Tag) {
Tag tag = (Tag) segment;
Element element = tag.getElement();
// check excludeTags
if (excludeTags.contains(tag.getName().toLowerCase())) {
skipToPos = element.getEnd();
}
// check excludeClasses
String classes = element.getAttributeValue("class");
if (classes != null) {
for (String theClass : classes.split(" ")) {
if (excludeClasses.contains(theClass.toLowerCase())) {
skipToPos = element.getEnd();
}
}
}
} else if (segment instanceof CharacterReference) { // for future use. Source.LegacyIteratorCompatabilityMode = true;
CharacterReference characterReference = (CharacterReference) segment;
} else {
outputDocument.replace(segment, doProcessText(segment.toString()));
}
}
}
return outputDocument.toString();
This should work.
String skipTag = null;
for (Segment segment : source) {
if (skipTag != null) { // is skipping ON?
if (segment instanceof EndTag && // if EndTag found for the
skipTag.equals(((EndTag) segment).getName())) { // tag we're skipping
skipTag = null; // set skipping OFF
}
continue; // continue skipping (or skip the EndTag)
} else if (segment instanceof Tag) { // is tag?
Tag tag = (Tag) segment;
System.out.println("FOUND TAG: " + tag.getName());
if (HTMLElementName.A.equals(tag.getName()) { // if <a> ?
skipTag = tag.getName(); // set
continue; // skipping ON
} else if (tag instanceof StartTag) {
if ("noProcess".equals( // if <tag class="noProcess" ..> ?
((StartTag) tag).getAttributeValue("class"))) {
skipTag = tag.getName(); // set
continue; // skipping ON
}
}
} // ...
}

Get Xpath from the org.w3c.dom.Node

Can i get the full xpath from the org.w3c.dom.Node ?
Say currently node is pointing to some where the middle of the xml document. I would like extract the xpath for that element.
The output xpath I'm looking for is //parent/child1/chiild2/child3/node. A parent to node xpath. Just ignore the xpath's which are having expressions and points to the same node.
There's no generic method for getting the XPath, mainly because there's no one generic XPath that identifies a particular node in the document. In some schemas, nodes will be uniquely identified by an attribute (id and name are probably the most common attributes.) In others, the name of each element (that is, the tag) is enough to uniquely identify a node. In a few (unlikely, but possible) cases, there's no one unique name or attribute that takes you to a specific node, and so you'd need to use cardinality (get the n'th child of the m'th child of...).
EDIT:
In most cases, it's not hard to create a schema-dependent function to assemble an XPath for a given node. For example, suppose you have a document where every node is uniquely identified by an id attribute, and you're not using namespaces. Then (I think) the following pseudo-Java would work to return an XPath based on those attributes. (Warning: I have not tested this.)
String getXPath(Node node)
{
Node parent = node.getParent();
if (parent == null) {
return "/" + node.getTagName();
}
return getXPath(parent) + "/" + "[#id='" + node.getAttribute("id") + "']";
}
I am working for the company behind jOOX, a library that provides many useful extensions to the Java standard DOM API, mimicking the jquery API. With jOOX, you can obtain the XPath of any element like this:
String path = $(element).xpath();
The above path will then be something like this
/document[1]/library[2]/books[3]/book[1]
I've taken this code from
Mikkel Flindt post & modified it so it can work for Attribute Node.
public static String getFullXPath(Node n) {
// abort early
if (null == n)
return null;
// declarations
Node parent = null;
Stack<Node> hierarchy = new Stack<Node>();
StringBuffer buffer = new StringBuffer();
// push element on stack
hierarchy.push(n);
switch (n.getNodeType()) {
case Node.ATTRIBUTE_NODE:
parent = ((Attr) n).getOwnerElement();
break;
case Node.ELEMENT_NODE:
parent = n.getParentNode();
break;
case Node.DOCUMENT_NODE:
parent = n.getParentNode();
break;
default:
throw new IllegalStateException("Unexpected Node type" + n.getNodeType());
}
while (null != parent && parent.getNodeType() != Node.DOCUMENT_NODE) {
// push on stack
hierarchy.push(parent);
// get parent of parent
parent = parent.getParentNode();
}
// construct xpath
Object obj = null;
while (!hierarchy.isEmpty() && null != (obj = hierarchy.pop())) {
Node node = (Node) obj;
boolean handled = false;
if (node.getNodeType() == Node.ELEMENT_NODE) {
Element e = (Element) node;
// is this the root element?
if (buffer.length() == 0) {
// root element - simply append element name
buffer.append(node.getNodeName());
} else {
// child element - append slash and element name
buffer.append("/");
buffer.append(node.getNodeName());
if (node.hasAttributes()) {
// see if the element has a name or id attribute
if (e.hasAttribute("id")) {
// id attribute found - use that
buffer.append("[#id='" + e.getAttribute("id") + "']");
handled = true;
} else if (e.hasAttribute("name")) {
// name attribute found - use that
buffer.append("[#name='" + e.getAttribute("name") + "']");
handled = true;
}
}
if (!handled) {
// no known attribute we could use - get sibling index
int prev_siblings = 1;
Node prev_sibling = node.getPreviousSibling();
while (null != prev_sibling) {
if (prev_sibling.getNodeType() == node.getNodeType()) {
if (prev_sibling.getNodeName().equalsIgnoreCase(
node.getNodeName())) {
prev_siblings++;
}
}
prev_sibling = prev_sibling.getPreviousSibling();
}
buffer.append("[" + prev_siblings + "]");
}
}
} else if (node.getNodeType() == Node.ATTRIBUTE_NODE) {
buffer.append("/#");
buffer.append(node.getNodeName());
}
}
// return buffer
return buffer.toString();
}
For me this one worked best ( using org.w3c.dom elements):
String getXPath(Node node)
{
Node parent = node.getParentNode();
if (parent == null)
{
return "";
}
return getXPath(parent) + "/" + node.getNodeName();
}
Some IDEs specialised in XML will do that for you.
Here are the most well known
oXygen
Stylus Studio
xmlSpy
For instance in oXygen, you can right-click on an element part of an XML document and the contextual menu will have an option 'Copy Xpath'.
There are also a number of Firefox add-ons (such as XPather that will happily do the job for you. For Xpather, you just click on a part of the web page and select in the contextual menu 'show in XPather' and you're done.
But, as Dan has pointed out in his answer, the XPath expression will be of limited use. It will not include predicates for instance. Rather it will look like this.
/root/nodeB[2]/subnodeX[2]
For a document like
<root>
<nodeA>stuff</nodeA>
<nodeB>more stuff</nodeB>
<nodeB cond="thisOne">
<subnodeX>useless stuff</subnodeX>
<subnodeX id="MyCondition">THE STUFF YOU WANT</subnodeX>
<subnodeX>more useless stuff</subnodeX>
</nodeB>
</root>
The tools I listed will not generate
/root/nodeB[#cond='thisOne']/subnodeX[#id='MyCondition']
For instance for an html page, you'll end-up with the pretty useless expression :
/html/body/div[6]/p[3]
And that's to be expected. If they had to generate predicates, how would they know which condition is relevant ? There are zillions of possibilities.
Something like this will give you a simple xpath:
public String getXPath(Node node) {
return getXPath(node, "");
}
public String getXPath(Node node, String xpath) {
if (node == null) {
return "";
}
String elementName = "";
if (node instanceof Element) {
elementName = ((Element) node).getLocalName();
}
Node parent = node.getParentNode();
if (parent == null) {
return xpath;
}
return getXPath(parent, "/" + elementName + xpath);
}

Categories

Resources