Match specific html attribute values

Match specific html attribute values - java

I would like to match all attribute values for id, class, name and for! I created a simple function for that task.
private Collection<String> getAttributes(final String htmlContent) {
final Set<String> attributes = new HashSet<>();
final Pattern pattern = Pattern.compile("(class|id|for|name)=\\\"(.*?)\\\"");
final Matcher matcher = pattern.matcher(htmlContent);
while (matcher.find()) {
attributes.add(matcher.group(2));
}
return attributes;
}
Example html content:
<input id="test" name="testName" class="aClass bClass" type="input" />
How can I split html classes via regular expression, so that I get the following result set:
test
testName
aClass
bClass
And is there any way to improve my code? I really don't like the loop.

If you take a look at the JSoup library you can find useful tools for html parsing and manipulation.
For example:
Document doc = ...//create HTML document
Elements htmlElements = doc.children();
htmlElements.traverse(new MyHtmlElementVisitor());
The class MyHtmlElementVisitor simply has to implement NodeVisitor and can access the Node attributes.
Though you might find a good regex for the same job, it has several drawbacks. Just to name a few:
hard to find a failsafe regex for every possible html document
hard to read, therefore difficult to find bugs and implement changes
the regex usually isn't reusable

Don't use regular expressions for parsing HTML. Seriously, it's more complicated than you think.
If your document is actually XHTML, you can use XPath:
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xpath.evaluate(
"//#*["
+ "local-name()='class'"
+ " or local-name()='id'"
+ " or local-name()='for'"
+ " or local-name()='name'"
+ "]",
new InputSource(new StringReader(htmlContent)),
XPathConstants.NODESET);
int count = nodes.getLength();
for (int i = 0; i < count; i++) {
Collections.addAll(attributes,
nodes.item(i).getNodeValue().split("\\s+"));
}
If it's not XHTML, you can use Swing's HTML parsing:
HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
private final Object[] attributesOfInterest = {
HTML.Attribute.CLASS,
HTML.Attribute.ID,
"for",
HTML.Attribute.NAME,
};
private void addAttributes(AttributeSet attr) {
for (Object a : attributesOfInterest) {
Object value = attr.getAttribute(a);
if (value != null) {
Collections.addAll(attributes,
value.toString().split("\\s+"));
}
}
}
#Override
public void handleStartTag(HTML.Tag tag,
MutableAttributeSet attr,
int pos) {
addAttributes(attr);
super.handleStartTag(tag, attr, pos);
}
#Override
public void handleSimpleTag(HTML.Tag tag,
MutableAttributeSet attr,
int pos) {
addAttributes(attr);
super.handleSimpleTag(tag, attr, pos);
}
};
HTMLDocument doc = (HTMLDocument)
new HTMLEditorKit().createDefaultDocument();
doc.getParser().parse(new StringReader(htmlContent), callback, true);
As for doing it without a loop, I don't think that's possible. But any implementation is going to use one or more loops internally anyway.

Related

Match set of simple xpaths with SAX

I have a set of simple xpaths involving only tags and attributes, no predicates. My XML input has a size of several MB so I want to use a streaming XML parser.
How can I match the streaming XML parser against the set of xapths to retrieve one value for each xpath?
The crux seems to build the right data structure from the set of xpaths so it can be evaluated based on the xml events.
This seems like a fairly common task but I couldn't find any readily available solutions.

To match a streaming XML parser against a set of simple xpaths, you can use the following steps:
Create a Map<String, String> to store the xpaths and their corresponding values. Initialize the values to null.
Create a Stack<String> to keep track of the current path of the XML elements.
Create a SAXParser and a DefaultHandler to parse the XML input.
In the startElement method of the handler, push the element name to the stack and append it to the current path. Then, check if the current path matches any of the xpaths in the map. If yes, set a flag to indicate that the value should be extracted.
In the endElement method of the handler, pop the element name from the stack and remove it from the current path. Then, reset the flag to indicate that the value should not be extracted.
In the characters method of the handler, check if the flag is set. If yes, append the character data to the value of the matching xpath in the map.
After parsing the XML input, return the map with the xpaths and their values.
Explanation
A streaming XML parser, such as SAXParser, reads the XML input sequentially and triggers events when it encounters different parts of the document, such as start tags, end tags, text, etc. It does not build a tree structure of the document in memory, which makes it more efficient for large XML inputs.
An xpath is a syntax for selecting nodes from an XML document. It consists of a series of steps, separated by slashes, that describe the location of the desired node. For example, /bookstore/book/title selects the title element of the book element of the bookstore element.
A simple xpath involves only tags and attributes, no predicates. For example, /bookstore/book[#lang='en']/title selects the title element of the book element that has an attribute lang with value en.
To match a streaming XML parser against a set of simple xpaths, we need to keep track of the current path of the XML elements as we parse the input, and compare it with the xpaths in the set. If we find a match, we need to extract the value of the node and store it in a map. We also need to handle the cases where the node value spans across multiple character events, or where the node has multiple occurrences in the document.
Example
Suppose we have the following XML input:
<bookstore>
<book lang="en">
<title>Harry Potter and the Philosopher's Stone</title>
<author>J. K. Rowling</author>
<price>10.99</price>
</book>
<book lang="fr">
<title>Le Petit Prince</title>
<author>Antoine de Saint-Exupéry</author>
<price>8.50</price>
</book>
</bookstore>
And the following set of simple xpaths:
/bookstore/book/title
/bookstore/book/author
/bookstore/book[#lang='fr']/price
We can use the following Java code to match the streaming XML parser against the set of xpaths:
import java.io.*;
import java.util.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
public class XPathMatcher {
public static Map<String, String> match(InputStream xmlInput, Set<String> xpaths) throws Exception {
// Create a map to store the xpaths and their values
Map<String, String> map = new HashMap<>();
for (String xpath : xpaths) {
map.put(xpath, null);
}
// Create a stack to keep track of the current path
Stack<String> stack = new Stack<>();
// Create a SAXParser and a DefaultHandler to parse the XML input
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
DefaultHandler handler = new DefaultHandler() {
// A flag to indicate if the value should be extracted
boolean extract = false;
// A variable to store the current path
String currentPath = "";
// A variable to store the matching xpath
String matchingXPath = "";
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
// Push the element name to the stack and append it to the current path
stack.push(qName);
currentPath += "/" + qName;
// Check if the current path matches any of the xpaths in the map
for (String xpath : map.keySet()) {
// If the xpath has an attribute, extract the attribute name and value
String attrName = "";
String attrValue = "";
if (xpath.contains("[#")) {
int start = xpath.indexOf("[#") + 2;
int end = xpath.indexOf("=");
attrName = xpath.substring(start, end);
start = end + 2;
end = xpath.indexOf("]");
attrValue = xpath.substring(start, end - 1);
}
// If the xpath matches the current path, and either has no attribute or has a matching attribute, set the flag and the matching xpath
if (xpath.startsWith(currentPath) && (attrName.isEmpty() || attrValue.equals(attributes.getValue(attrName)))) {
extract = true;
matchingXPath = xpath;
break;
}
}
}
#Override
public void endElement(String uri, String localName, String qName) throws SAXException {
// Pop the element name from the stack and remove it from the current path
stack.pop();
currentPath = currentPath.substring(0, currentPath.length() - qName.length() - 1);
// Reset the flag and the matching xpath
extract = false;
matchingXPath = "";
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
// Check if the flag is set
if (extract) {
// Append the character data to the value of the matching xpath in the map
String value = map.get(matchingXPath);
if (value == null) {
value = "";
}
value += new String(ch, start, length);
map.put(matchingXPath, value);
}
}
};
// Parse the XML input
parser.parse(xmlInput, handler);
// Return the map with the xpaths and their values
return map;
}
public static void main(String[] args) throws Exception {
// Create an input stream from the XML file
InputStream xmlInput = new FileInputStream("bookstore.xml");
// Create a set of simple xpaths
Set<String> xpaths = new HashSet<>();
xpaths.add("/bookstore/book/title");
xpaths.add("/bookstore/book/author");
xpaths.add("/bookstore/book[#lang='fr']/price");
// Match the streaming XML parser against the set of xpaths
Map<String, String> map = match(xmlInput, xpaths);
// Print the results
for (String xpath : map.keySet()) {
System.out.println(xpath + " = " + map.get(xpath));
}
}
}
The output of the code is:
/bookstore/book/title = Harry Potter and the Philosopher's StoneLe Petit Prince
/bookstore/book/author = J. K. RowlingAntoine de Saint-Exupéry
/bookstore/book[#lang='fr']/price = 8.50

Attempting to use Java & Xpath to updated the value of any selected node inside an XML doc

I have developed GUI tool the displays an XML document as an editable JTree, and the user can select a node in the JTree and attempt to change the actual nodes value in the XML document.
The problem that I'm having is with constructing the correct Xpath query that attempts the actual update.
Here is GUI of the JTree showing which element was selected & should be edited:
Its a very large XMl, so here the collapsed snippet of the XML:
UPDATE (IGNORE ATTEMPT 1 & 2, 1ST ISSUE WAS RESOLVED, GO TO ATTEMPTS 3 & 4)
Attempt 1 # (relevant Java method that attempts to create XPath query to update a nodes value):
public void updateXmlData(JTree jTree, org.w3c.dom.Document doc, TreeNode parentNode, String oldValue, String newValue) throws XPathExpressionException {
System.out.println("Selected path=" + jTree.getSelectionPath().toString());
String[] pathTockens = jTree.getSelectionPath().toString().split(",");
StringBuilder sb = new StringBuilder();
//for loop to construct xpath query
for (int i = 0; i < pathTockens.length - 1; i++) {
if (i == 0) {
sb.append("//");
} else {
sb.append(pathTockens[i].trim());
sb.append("/");
}
}//end for loop
sb.append("text()");
System.out.println("Constructed XPath Query:" + sb.toString());
//new xpath
XPath xpath = XPathFactory.newInstance().newXPath();
//compile query
NodeList nodes = (NodeList) xpath.compile(sb.toString()).evaluate(doc, XPathConstants.NODESET);
//Make the change on the selected nodes
for (int idx = 0; idx < nodes.getLength(); idx++) {
Node value = nodes.item(idx).getAttributes().getNamedItem("value");
String val = value.getNodeValue();
value.setNodeValue(val.replaceAll(oldValue, newValue));
}
//set the new updated xml doc
SingleTask.currentTask.setDoc(doc);
}
Console logs:
Selected path=[C:\Users\xyz\Documents\XsdToXmlFiles\sampleIngest.xml, Ingest, Property_Maps, identifier, identifieXYZ]
Constructed XPath Query://Ingest/Property_Maps/identifier/text()
Jan 26, 2021 2:04:16 PM com.xyz.XmlToXsdValidator.Views.EditXmlTreeNodeDialogJFrame jButtonOkEditActionPerformed
SEVERE: null
javax.xml.transform.TransformerException: Unable to evaluate expression using this context
at com.sun.org.apache.xpath.internal.XPath.execute(XPath.java:368)
As you can see in the logs:
Selected path=[C:\Users\xyz\Documents\XsdToXmlFiles\sampleIngest.xml, Ingest, Property_Maps, identifier, identifieXYZ]
Constructed XPath Query://Ingest/Property_Maps/identifier/text()
The paths are correct, basically Ingest->Property_Maps->identifier->text()
But Im getting:
javax.xml.transform.TransformerException: Unable to evaluate expression using this context
Attempt 2 # (relevant Java method that attempts to create XPath query to update a nodes value):
public void updateXmlData(JTree jTree, org.w3c.dom.Document doc, TreeNode parentNode, String oldValue, String newValue) throws XPathExpressionException {
// Locate the node(s) with xpath
System.out.println("Selected path=" + jTree.getSelectionPath().toString());
String[] pathTockens = jTree.getSelectionPath().toString().split(",");
StringBuilder sb = new StringBuilder();
//loop to construct xpath query
for (int i = 0; i < pathTockens.length - 1; i++) {
if (i == 0) {
sb.append("//");
} else {
sb.append(pathTockens[i].trim());
sb.append("/");
}
}//end loop
sb.append("[text()=");
sb.append("'");
sb.append(oldValue);
sb.append("']");
int lastIndexOfPathChar = sb.lastIndexOf("/");
sb.replace(lastIndexOfPathChar, lastIndexOfPathChar + 1, "");
System.out.println("Constructed XPath Query:" + sb.toString());
//new xpath instance
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xpath.evaluate(sb.toString(), doc, XPathConstants.NODESET);
//Make the change on the selected nodes
for (int idx = 0; idx < nodes.getLength(); idx++) {
Node value = nodes.item(idx).getAttributes().getNamedItem("value");
String val = value.getNodeValue();
value.setNodeValue(val.replaceAll(oldValue, newValue));
}
SingleTask.currentTask.setDoc(doc);
}
I was able to resolve the exception based Andreas comment, and there are no more exceptions/errors, however the XPath query does not find selected nodes. Returns empty
New updated code:
Attempt # 3 Using custom namespace resolver. References: https://www.kdgregory.com/index.php?page=xml.xpath
public boolean updateXmlData(JTree jTree, org.w3c.dom.Document doc, TreeNode parentNode, String oldValue, String newValue) throws XPathExpressionException {
System.out.println("Selected path=" + jTree.getSelectionPath().toString());
boolean changed = false;
// Locate the node(s) with xpath
String[] pathTockens = jTree.getSelectionPath().toString().split(",");
StringBuilder sb = new StringBuilder();
//loop to construct xpath query
for (int i = 0; i < pathTockens.length - 1; i++) {
if (i == 0) {
//do nothing
} else if (i == 1) {
sb.append("/ns:" + pathTockens[i].trim());
} else if (i > 1 && i != pathTockens.length - 1) {
sb.append("/ns:" + pathTockens[i].trim());
} else {
//sb.append("/" + pathTockens[i].trim());
}
}//end loop
sb.append("[text()=");
sb.append("'");
sb.append(oldValue);
sb.append("']");
System.out.println("Constructed XPath Query:" + sb.toString());
//new xpath instance
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();
xpath.setNamespaceContext(new UniversalNamespaceResolver(SingleTask.currentTask.getXsdFile().getXsdNameSpace()));
NodeList nodes = (NodeList) xpath.evaluate(sb.toString(), doc, XPathConstants.NODESET);
//start for
Node node;
String val = null;
for (int idx = 0; idx < nodes.getLength(); idx++) {
if (nodes.item(idx).getAttributes() != null) {
node = nodes.item(idx).getAttributes().getNamedItem("value");
if (node != null) {
val = node.getNodeValue();
node.setNodeValue(val.replaceAll(oldValue, newValue));
changed = true;
break;
}//end if node is found
}
}//end for
//set the new updated xml doc
SingleTask.currentTask.setDoc(doc);
return changed;
}
Class that implements custom namespace resolver:
import java.util.Arrays;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;
import javax.xml.XMLConstants;
import javax.xml.namespace.NamespaceContext;
import org.w3c.dom.Document;
/**
*
* References:https://www.kdgregory.com/index.php?page=xml.xpath
*/
//custom NamespaceContext clss implementation
public class UniversalNamespaceResolver implements NamespaceContext
{
private String _prefix = "ns";
private String _namespaceUri=null;
private List<String> _prefixes = Arrays.asList(_prefix);
public UniversalNamespaceResolver(String namespaceResolver)
{
_namespaceUri = namespaceResolver;
}
#Override
#SuppressWarnings("rawtypes")
public Iterator getPrefixes(String uri)
{
if (uri == null)
throw new IllegalArgumentException("UniversalNamespaceResolver getPrefixes() URI may not be null");
else if (_namespaceUri.equals(uri))
return _prefixes.iterator();
else if (XMLConstants.XML_NS_URI.equals(uri))
return Arrays.asList(XMLConstants.XML_NS_PREFIX).iterator();
else if (XMLConstants.XMLNS_ATTRIBUTE_NS_URI.equals(uri))
return Arrays.asList(XMLConstants.XMLNS_ATTRIBUTE).iterator();
else
return Collections.emptyList().iterator();
}
#Override
public String getPrefix(String uri)
{
if (uri == null)
throw new IllegalArgumentException("nsURI may not be null");
else if (_namespaceUri.equals(uri))
return _prefix;
else if (XMLConstants.XML_NS_URI.equals(uri))
return XMLConstants.XML_NS_PREFIX;
else if (XMLConstants.XMLNS_ATTRIBUTE_NS_URI.equals(uri))
return XMLConstants.XMLNS_ATTRIBUTE;
else
return null;
}
#Override
public String getNamespaceURI(String prefix)
{
if (prefix == null)
throw new IllegalArgumentException("prefix may not be null");
else if (_prefix.equals(prefix))
return _namespaceUri;
else if (XMLConstants.XML_NS_PREFIX.equals(prefix))
return XMLConstants.XML_NS_URI;
else if (XMLConstants.XMLNS_ATTRIBUTE.equals(prefix))
return XMLConstants.XMLNS_ATTRIBUTE_NS_URI;
else
return null;
}
}
Console Output:
Selected path=[C:\Users\xyz\DocumentsIngest_LDD.xml, Ingest_LDD, Property_Maps, identifier, identifier1]
Constructed XPath: Query:/ns:Ingest_LDD/ns:Property_Maps/ns:identifier[text()='identifier1']
Attempt #4 (Without custom namespace resolver):
public boolean updateXmlData(JTree jTree, org.w3c.dom.Document doc, TreeNode parentNode, String oldValue, String newValue) throws XPathExpressionException {
System.out.println("Selected path=" + jTree.getSelectionPath().toString());
boolean changed = false;
// Locate the node(s) with xpath
String[] pathTockens = jTree.getSelectionPath().toString().split(",");
StringBuilder sb = new StringBuilder();
//loop to construct xpath query
for (int i = 0; i < pathTockens.length - 1; i++) {
if (i == 0) {
//do nothing
} else if (i == 1) {
sb.append("/" + pathTockens[i].trim());
} else if (i > 1 && i != pathTockens.length - 1) {
sb.append("/" + pathTockens[i].trim());
} else {
//sb.append("/" + pathTockens[i].trim());
}
}//end loop
sb.append("[text()=");
sb.append("'");
sb.append(oldValue);
sb.append("']");
System.out.println("Constructed XPath Query:" + sb.toString());
//new xpath instance
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();
//WITHOUT CUSTOM NAMESPACE CONTEXT xpath.setNamespaceContext(new UniversalNamespaceResolver(SingleTask.currentTask.getXsdFile().getXsdNameSpace()));
NodeList nodes = (NodeList) xpath.evaluate(sb.toString(), doc, XPathConstants.NODESET);
//start for
Node node;
String val = null;
for (int idx = 0; idx < nodes.getLength(); idx++) {
if (nodes.item(idx).getAttributes() != null) {
node = nodes.item(idx).getAttributes().getNamedItem("value");
if (node != null) {
val = node.getNodeValue();
node.setNodeValue(val.replaceAll(oldValue, newValue));
changed = true;
break;
}//end if node is found
}
}//end for
//set the new updated xml doc
SingleTask.currentTask.setDoc(doc);
return changed;
}
Console Output:
Selected path=[C:\Users\anaim\Documents\XsdToXmlFiles\sampleIngest_LDD.xml, Ingest_LDD, Property_Maps, identifier, identifier1]
Constructed XPath Query:/Ingest_LDD/Property_Maps/identifier[text()='identifier1']
I actually manually wrote the XPath query online using (https://www.freeformatter.com/xpath-tester.html#ad-output)
Sorry, I cant provide the sample XMl, its way too large.
The manual XPath query was:
/Ingest_LDD/Property_Maps/identifier[text()='identifier1']
And the online tool successfully found the text & outputted:
Element='<identifier xmlns="http://pds.nasa.gov/pds4/pds/v1">identifier1</identifier>'
Therefore my code under attempt #4 & the query should work?
UPDATED ATTEMPTS AFTER USER INPUT:
Attempt #5 (based on response from user, namespace aware = TRUE ), relevant code is below
factory.setNamespaceAware(true);
doc = dBuilder.parse(xmlFile);
if (doc!=null)
{
//***NOTE program comes meaning doc is NOT null, however inspecting it shows [#document: null]
doc.getDocumentElement().normalize();
}
xpath.setNamespaceContext(new UniversalNamespaceResolver(SingleTask.currentTask.getXsdFile().getXsdNameSpace()));
Node node = (Node) xpath.evaluate(sb.toString(), doc, XPathConstants.NODE);
if (node!=null)
{
// See https://docs.oracle.com/javase/9/docs/api/org/w3c/dom/Node.html#setTextContent-java.lang.String-
node.setTextContent(newValue);
SingleTask.currentTask.setDoc(doc);
}
Output (again unable to find the node):
Selected path=[C:\Users\xyz\Documents\XsdToXmlFiles\sampleIngest_LDD.xml, Ingest_LDD, name, name1]
Constructed XPath Query:/Ingest_LDD/name[text()='name1']
Error changing value!
Attempt #6 (based on response from user, namespace aware = FALSE )
factory.setNamespaceAware(false);
doc = dBuilder.parse(xmlFile);
if (doc!=null)
{
//***NOTE program comes meaning doc is NOT null, however inspecting it shows [#document: null]
doc.getDocumentElement().normalize();
}
//COMMENTED OUT , SINCE NAMESPACE AWARE FALSE xpath.setNamespaceContext(new UniversalNamespaceResolver(SingleTask.currentTask.getXsdFile().getXsdNameSpace()));
Node node = (Node) xpath.evaluate(sb.toString(), doc, XPathConstants.NODE);
if (node!=null)
{
// See https://docs.oracle.com/javase/9/docs/api/org/w3c/dom/Node.html#setTextContent-java.lang.String-
node.setTextContent(newValue);
SingleTask.currentTask.setDoc(doc);
}
Output (again unable to find the node):
Selected path=[C:\Users\xyz\Documents\XsdToXmlFiles\sampleIngest_LDD.xml, Ingest_LDD, name, name1]
Constructed XPath Query:/Ingest_LDD/name[text()='name1']
Error changing value!
The document that is being returned as [#document: null] may not actually be the problem according to(DocumentBuilder.parse(InputStream) returns null)???
Attempt # 7 (namespace aware FALSE)
Also NamedNodeMap namedNodeMap = doc.getAttributes(); returns NULL.
However, Node firstChild = doc.getFirstChild() actually returns valid element!
I passed firstChild to xpath.evaluate(sb.toString(), firstChild , XPathConstants.NODE); but again the node desired node was not found.
Output (again unable to find the node):
Selected path=[C:\Users\xyz\Documents\XsdToXmlFiles\sampleIngest_LDD.xml, Ingest_LDD, name, name1]
Constructed XPath Query:/Ingest_LDD/name[text()='name1']
Error changing value!
Attempt # 8 (namespace aware false)
I also attemped to pass in doc.getChildNodes() to xpath.evaluate() rather than doc object as final desperate atteempt, see snippet below.
if (doc != null) {
NodeList nodes = (NodeList) xpath.evaluate(sb.toString(), doc.getChildNodes(), XPathConstants.NODESET);
String val = null;
Node node;
for (int idx = 0; idx < nodes.getLength(); idx++) {
if (nodes.item(idx).getAttributes() != null) {
node = nodes.item(idx).getAttributes().getNamedItem("value");
if (node != null) {
val = node.getNodeValue();
node.setNodeValue(val.replaceAll(oldValue, newValue));
changed = true;
break;
}//end if node is found
}
}//end for
}
Output (again unable to find the node):
Selected path=[C:\Users\xyz\Documents\XsdToXmlFiles\sampleIngest_LDD.xml, Ingest_LDD, name, name1]
Constructed XPath Query:/Ingest_LDD/name[text()='name1']
Error changing value!

For the test you performed online it seems your XML file contains namespace information.
With that information in mind, probably both of your examples of XPath evaluation would work, or not, dependent on several things.
For example, you probably can use the attempt #4, and the XPath evaluation will be adequate, if you are using a non namespace aware (the default) DocumentBuilderFactory and you o not provide any namespace information in your XPath expression.
But the XPath evaluation in attempt #3 can also be adequate if the inverse conditions apply, i.e., you are using a namespace aware DocumentBuilderFactory:
DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
f.setNamespaceAware(true);
and you provide namespace information in your XPath expression and a convenient NamespaceContext implementation. Please, see this related SO question and this great IBM article.
Be aware that you do not need to provide the same namespace prefixes in both your XML file an XPath expression, the only requirement is namespace awareness in XML (XPath is always namespace aware).
Given that conditions, I think you can apply both approaches.
In any case, I think the problem may have to do with the way you are dealing with the actual text replacement: you are looking for a node with a value attribute, and reviewing the associated XML Schema this attribute does not exist.
Please, consider instead the following approach:
// You can get here following both attempts #3 an #4
Node node = (Node) xpath.evaluate(sb.toString(), doc, XPathConstants.NODE);
boolean changed = node != null;
if (changed) {
// See https://docs.oracle.com/javase/9/docs/api/org/w3c/dom/Node.html#setTextContent-java.lang.String-
node.setTextContent(newValue);
SingleTask.currentTask.setDoc(doc);
}
return changed;
This code assumes that the selected node will be unique to work properly.
Although probably unlike, please, be aware that the way in which you are constructing the XPath selector from the JTree model can provide duplicates if you define the same value for repeated elements in your XML. Consider the elements external_id_property_maps in your screenshot, for instance.
In order to avoid that, you can take a different approach when constructing the XPath selector.
It is unclear for your code snippet, but probably you are using DefaultMutableTreeNode as the base JTree node type. If that is the case, you can associate with every node the arbitrary information you need to.
Consider for example the creation of a simple POJO with two fields, the name of the Element that the node represents, and some kind of unique, generated, id, let's name it uid or uuid to avoid confusion with the id attribute, most likely included in the original XML document.
This uid should be associated with every node. Maybe you can take advantage of the JTree creation process and, while processing every node of your XML file, include this attribute as well, generated using the UUID class, for example.
Or you can apply a XSLT transform to the original XML document prior to representation:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="xml" omit-xml-declaration="yes"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:attribute name="uid">
<xsl:value-of select="generate-id(.)"/>
</xsl:attribute>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
With this changes, your XPath query should looks like:
/ns:Ingest_LDD[#uid='w1ab1']/ns:Property_Maps[#uid='w1ab1a']/ns:identifier[#uid='w1ab1aq']
Of course, it will be necessary to modify the code devoted to the construction of this expression from the selected path of the JTree to take the custom object into account.
You can take this approach to the limit and use a single selector based solely in this uid attribute, although I think that for performance reasons it will be not appropriate:
//*[#uid='w1ab1']
Putting it all together, you can try something like the following.
Please, consider this XML file:
<?xml version="1.0" encoding="utf-8" ?>
<Ingest_LDD xmlns="http://pds.nasa.gov/pds4/pds/v1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://pds.nasa.gov/pds4/pds/v1 https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1700.xsd">
<!-- Please, forgive me, I am aware that the document is not XML Schema conformant,
only for exemplification of the default namespace -->
<Property_Maps>
<identifier>identifier1</identifier>
</Property_Maps>
</Ingest_LDD>
First, let's parse the document:
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
// As the XML contains namespace, let's configure the parser namespace aware
// This will be of relevance when evaluating XPath
builderFactory.setNamespaceAware(true);
DocumentBuilder builder = builderFactory.newDocumentBuilder();
// Parse the document from some source
Document document = builder.parse(...);
// See http://stackoverflow.com/questions/13786607/normalization-in-dom-parsing-with-java-how-does-it-work
document.getDocumentElement().normalize();
Now, create a JTree structure corresponding to the input XML file. First, let's create a convenient POJO to store the required tree node information:
public class NodeInformation {
// Node name
private String name;
// Node uid
private String did;
// Node value
private String value;
// Setters and getters
// Will be reused by DefaultMutableTreeNode
#Override
public String toString() {
return this.name;
}
}
Convert the XML file to its JTree counterpart:
// Get a reference to root element
Element rootElement = document.getDocumentElement();
// Create root tree node
DefaultMutableTreeNode rootTreeNode = getNodeInformation(rootElement);
// Traverse DOM
traverse(rootTreeNode, rootElement);
// Create tree and tree model based on the computed root tree node
DefaultTreeModel treeModel = new DefaultTreeModel(rootTreeNode);
JTree tree = new JTree(treeModel);
Where:
private NodeInformation getNodeInformation(Node childElement) {
NodeInformation nodeInformation = new NodeInformation();
String name = childElement.getNodeName();
nodeInformation.setName(name);
// Provide a new unique identifier for every node
String uid = UUID.randomUUID().toString();
nodeInformation.setUid(uid);
// Uhnn.... We need to associate the new uid with the DOM node as well.
// There is nothing wrong with it but mutating the DOM in this way in
// a method that should be "read-only" is not the best solution.
// It would be interesting to study the above-mentioned XSLT approach
chilElement.setAttribute("uid", uid);
// Compute node value
StringBuffer buffer = new StringBuffer();
NodeList childNodes = childElement.getChildNodes();
boolean found = false;
for (int i = 0; i < childNodes.getLength(); i++) {
Node node = childNodes.item(i);
if (node.getNodeType() == Node.TEXT_NODE) {
String value = node.getNodeValue();
buffer.append(value);
found = true;
}
}
if (found) {
nodeInformation.setValue(buffer.toString());
}
}
And:
// Finds all the child elements and adds them to the parent node recursively
private void traverse(DefaultMutableTreeNode parentTreeNode, Node parentXMLElement) {
NodeList childElements = parentXMLElement.getChildNodes();
for(int i=0; i<childElements.getLength(); i++) {
Node childElement = childElements.item(i);
if (childElement.getNodeType() == Node.ELEMENT_NODE) {
DefaultMutableTreeNode childTreeNode =
new DefaultMutableTreeNode
(getNodeInformation(childElement));
parentTreeNode.add(childTreeNode);
traverse(childTreeNode, childElement);
}
}
}
Although the NamespaceContext implementation you provided looks fine, please, at a first step, try something simpler, to minimize the possibility of error. See the provided implementation below.
Then, your updateXMLData method should looks like:
public boolean updateXmlData(JTree tree, org.w3c.dom.Document doc, TreeNode parentNode, String oldValue, String newValue) throws XPathExpressionException {
boolean changed = false;
TreePath selectedPath = tree.getSelectionPath();
int count = getPathCount();
StringBuilder sb = new StringBuilder();
NodeInformation lastNodeInformation;
if (count > 0) {
for (int i = 1; i < trp.getPathCount(); i++) {
DefaultMutableTreeNode treeNode = (DefaultMutableTreeNode) trp.getPathComponent(i);
NodeInformation nodeInformation = (NodeInformation) treeNode.getUserObject();
sb.append(String.format("/ns:%s[#uid='%s']", nodeInformation.getName(), nodeInformation.getUid());
lastNodeInformation = nodeInformation;
}
}
System.out.println("Constructed XPath Query:" + sb.toString());
// Although the `NamespaceContext` implementation you provided looks
// fine, please, at a first step, try something simpler, to minimize the
// possibility of error. For example:
NamespaceContext nsContext = new NamespaceContext() {
public String getNamespaceURI(String prefix) {
if (prefix == null) {
throw new IllegalArgumentException("No prefix provided!");
} else if (prefix.equals(XMLConstants.DEFAULT_NS_PREFIX)) {
return "http://pds.nasa.gov/pds4/pds/v1";
} else if (prefix.equals("ns")) {
return "http://pds.nasa.gov/pds4/pds/v1";
} else {
return XMLConstants.NULL_NS_URI;
}
}
public String getPrefix(String namespaceURI) {
// Not needed in this context.
return null;
}
public Iterator getPrefixes(String namespaceURI) {
// Not needed in this context.
return null;
}
};
//new xpath instance
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();
// As the parser is namespace aware, we can safely use XPath namespaces
xpath.setNamespaceContext(nsContext);
Node node = (Node) xpath.evaluate(sb.toString(), doc, XPathConstants.NODE);
boolean changed = node != null;
if (changed) {
// See https://docs.oracle.com/javase/9/docs/api/org/w3c/dom/Node.html#setTextContent-java.lang.String-
node.setTextContent(newValue);
SingleTask.currentTask.setDoc(doc);
// Probably the information has been updated in the node, but just in case:
lastNodeInformation.setValue(newValue);
}
return changed;
}
The generated XPath expression will look like:
/ns:Ingest_LDD[#uid='w1ab1']/ns:Property_Maps[#uid='w1ab1a']/ns:identifier[#uid='w1ab1aq']
If you want to use the default namespace, you can also try with:
/:Ingest_LDD[#uid='w1ab1']/:Property_Maps[#uid='w1ab1a']/:identifier[#uid='w1ab1aq']
Please, be aware that I haven't tested the code, but I hope you get the idea.
Just for clarification, in order to give you a proper answer, as mentioned before, if you now remove or comment this line of code:
builderFactory.setNamespaceAware(true);
Then, the XPath expression:
/ns:Ingest_LDD[#uid='w1ab1']/ns:Property_Maps[#uid='w1ab1a']/ns:identifier[#uid='w1ab1aq']
will no longer find the required node. Now, if you remove the namespace information from the XPath expression:
/Ingest_LDD[#uid='w1ab1']/Property_Maps[#uid='w1ab1a']/identifier[#uid='w1ab1aq']
It will find the right node again.

how to retrieve XML data using XPath which has a default namespace in Java?

I've come across and problem that I've looked up on stack overflow but none of the solutions seems to solve the problem for me.
I'm retrieving XML data from Yahoo and it comes back as below (truncated for brevity's sake).
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<fantasy_content xmlns="http://fantasysports.yahooapis.com/fantasy/v2/base.rng" xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" copyright="Data provided by Yahoo! and STATS, LLC" refresh_rate="31" time="55.814027786255ms" xml:lang="en-US" yahoo:uri="http://fantasysports.yahooapis.com/fantasy/v2/league/328.l.108462/settings">
<league>
<league_key>328.l.108462</league_key>
<league_id>108462</league_id>
<draft_status>postdraft</draft_status>
</league>
</fantasy_content>
I've been having a problem getting XPath to retrieve any elements so I've written a unit test to try to resolve it and it looks like:
final File file = new File("league-settings.xml");
javax.xml.parsers.DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
dbFactory.setNamespaceAware(true);
javax.xml.parsers.DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
org.w3c.dom.Document doc = dBuilder.parse(file);
javax.xml.xpath.XPath xPath = XPathFactory.newInstance().newXPath();
xPath.setNamespaceContext(new YahooNamespaceContext());
final String expression = "yfs:league";
final XPathExpression expr = xPath.compile(expression);
Object nodes = expr.evaluate(doc, XPathConstants.NODESET);
assert(nodes instanceof NodeList);
NodeList leagueNodes = (NodeList)nodes;
int leaguesLength = leagueNodes.getLength();
assertEquals(leaguesLength, 1);
The YahooNamespaceContext class I created to map the namespaces looks as follows:
public class YahooNamespaceContext implements NamespaceContext {
public static final String YAHOO_NS = "http://www.yahooapis.com/v1/base.rng";
public static final String DEFAULT_NS = "http://fantasysports.yahooapis.com/fantasy/v2/base.rng";
public static final String YAHOO_PREFIX = "yahoo";
public static final String DEFAULT_PREFIX = "yfs";
private final Map<String, String> namespaceMap = new HashMap<String, String>();
public YahooNamespaceContext() {
namespaceMap.put(DEFAULT_PREFIX, DEFAULT_NS);
namespaceMap.put(YAHOO_PREFIX, YAHOO_NS);
}
public String getNamespaceURI(String prefix) {
return namespaceMap.get(prefix);
}
public String getPrefix(String uri) {
throw new UnsupportedOperationException();
}
public Iterator<String> getPrefixes(String uri) {
throw new UnsupportedOperationException();
}
}
Any help with people with more experience with XML namespaces or debugging tips into Xpath compilation/evaluation would be appreciated.

If the problem is that you're getting zero as the length of the result nodelist, have you tried changing
final String expression = "yfs:league";
to
final String expression = "//yfs:league";
?
It appears that the context for evaluating your XPath expressions, doc, is the root node of the document. dBuilder.parse(file) returns the document root node, not the outermost element (a.k.a. document element). Remember, in XPath, a root node is not an element. So doc
is not the yfs:fantasy_content element node but is its (invisible) parent.
In that context, the XPath expression "yfs:league" will only select an element that is a direct child of that root node, of which there is no yfs:league -- only yfs:fantasy_content.

The XPath expression yfs:league is equivalent to child::yfs:league. It means: find direct children nodes (not descendants) of doc with the specified local name (league) and namespace URI (http://fantasysports.yahooapis.com/fantasy/v2/base.rng).
You must take into account the outermost element (fantasy_content) or search for descendant instead of child nodes.
Replacing
final String expression = "yfs:league";
with
final String expression = "yfs:fantasy_content/yfs:league";
or with
final String expression = "//yfs:league";
will solve the problem.

using xpath in java to go through this list?

i have an xml file that contains lots of different nodes. some in particularly are nested like this:
<emailAddresses>
<emailAddress>
<value>sambj1981#gmail.com</value>
<typeSource>WORK</typeSource>
<typeUser></typeUser>
<primary>false</primary>
</emailAddress>
<emailAddress>
<value>sambj#hotmail.co.uk</value>
<typeSource>HOME</typeSource>
<typeUser></typeUser>
<primary>true</primary>
</emailAddress>
</emailAddresses>
From the above node, what i want to do is go through each and get the values inside it(value, typeSource, typeUser etc) and put them in a POJO.
i tried to see if i can use this xpath expression "//emailAddress" but it doesnt return me the tags inside inside it. maybe i am doing it wrong. i am pretty new to using xpath.
i could do something like this:
//emailAddress/value | //emailAddress/typeSource | .. but doing that will list all elements values together if im not mistaken leaving me to work out when i have finished reading from a specific emailAddress tag and going to the next emailAddress tag.
well to sum up my needs i basically want this to be returned similar to how you would return results from a bog standard sql query that returns results in a row. i.e. if your sql query produces 10 emailAddress, it will return each emailAddress in a row and i can simply iterate over "each emailAddress" and get the appropriate value based on the colunm name or index.

No,
//emailAddress
doesn't return the tags inside, that is correct. What it does return is a NodeList/NodeSet. To actually get the values you can do something like this:
String emailpath = "//emailAddress";
String emailvalue = ".//value";
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xpath = xPathFactory.newXPath();
Document document;
public XpathStuff(String file) throws ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = docFactory.newDocumentBuilder();
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
document = builder.parse(bis);
NodeList nodeList = getNodeList(document, emailpath);
for(int i = 0; i < nodeList.getLength(); i++){
System.out.println(getValue(nodeList.item(i), emailvalue));
}
bis.close();
}
public NodeList getNodeList(Document doc, String expr) {
try {
XPathExpression pathExpr = xpath.compile(expr);
return (NodeList) pathExpr.evaluate(doc, XPathConstants.NODESET);
} catch (XPathExpressionException e) {
e.printStackTrace();
}
return null;
}
//extracts the String value for the given expression
private String getValue(Node n, String expr) {
try {
XPathExpression pathExpr = xpath.compile(expr);
return (String) pathExpr.evaluate(n,
XPathConstants.STRING);
} catch (XPathExpressionException e) {
e.printStackTrace();
}
return null;
}
Maybe I should point out that when iterating over the Nodelist, in .//values the first dot means the current context. Without the dot you would get the first node all the time.

//emailAddress/*
will get these nodes in the document order.
It depends on how you want to iterate through the nodes. We do all our XML using XOM (http://www.xom.nu/) which is an easy reliable Java package. It's possible to write your own strategy using XOM calls.

If you use XStream you can set it up quite easily. Like so:
#XStreamAlias( "EmailAddress" )
public class EmailAddress {
#XStreamAlias()
private String value;
#XStreamAlias()
private String typeSource;
#XStreamAlias()
private String typeUser;
#XStreamAlias()
private boolean primary;
// ... the rest omitted for brevity
}
You then marshal & unmarshal quite simply like so:
XStream xstream = new XStream();
xstream.processAnnotations( EmailAddress.class );
xstream.toXML( /* Object value here */ emailAddress );
xstream.fromXML( /* String xml value here */ "" );
IDK if you have to use XPath or not, but if not I'd consider an out of the box solution like this.

I am totally aware this is not what you were asking for, but may consider using jibx. This is a tool for human-readable XML to POJO mapping.
So I believe you could generate mapping for your email structure in a quick way and let the jibx do the work for you.

Java/DOM: Get the XML content of a node

I am parsing a XML file in Java using the W3C DOM.
I am stuck at a specific problem, I can't figure out how to get the whole inner XML of a node.
The node looks like that:
<td><b>this</b> is a <b>test</b></td>
What function do I have to use to get that:
"<b>this</b> is a <b>test</b>"

I know this was asked long ago but for the next person searching (was me today), this works with JDOM:
JDOMXPath xpath = new JDOMXPath("/td");
String innerXml = (new XMLOutputter()).outputString(xpath.selectNodes(document));
This passes a list of all child nodes into outputString, which will serialize them out in order.

You have to use the transform/xslt API using your <b> node as the node to be transformed and put the result into a new StreamResult(new StringWriter());
. See how-to-pretty-print-xml-from-java

What do you say about this ?
I had same problem today on android, but i managed to make simple "serializator"
private String innerXml(Node node){
String s = "";
NodeList childs = node.getChildNodes();
for( int i = 0;i<childs.getLength();i++ ){
s+= serializeNode(childs.item(i));
}
return s;
}
private String serializeNode(Node node){
String s = "";
if( node.getNodeName().equals("#text") ) return node.getTextContent();
s+= "<" + node.getNodeName()+" ";
NamedNodeMap attributes = node.getAttributes();
if( attributes!= null ){
for( int i = 0;i<attributes.getLength();i++ ){
s+=attributes.item(i).getNodeName()+"=\""+attributes.item(i).getNodeValue()+"\"";
}
}
NodeList childs = node.getChildNodes();
if( childs == null || childs.getLength() == 0 ){
s+= "/>";
return s;
}
s+=">";
for( int i = 0;i<childs.getLength();i++ )
s+=serializeNode(childs.item(i));
s+= "</"+node.getNodeName()+">";
return s;
}

er... you could also call toString() and just chop off the beginning and end tags, either manually or using regexps.
edit: toString() doesn't do what I expected. Pulling out the O'Reilly Java & XML book talks about the Load and Save module of Java DOM.
See in particular the LSSerializer which looks very promising. You could either call writeToString(node) and chop off the beginning and end tags, as I suggested, or try to use LSSerializerFilter to not print the top node tags (not sure if that would work; I admit I've never used LSSerializer before.)
Reading the O'Reilly book seems to indicate doing something like this:
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS lsImpl =
(DOMImplementationLS)registry.getDOMImplementation("LS");
LSSerializer serializer = lsImpl.createLSSerializer();
String nodeString = serializer.writeToString(node);

node.getTextContent();
You ought to be using JDom of Dom4J to handle nodes, if for no other reasons, to handle whitespace correctly.

To remove unneccesary tags probably such code can be used:
DOMConfiguration config = serializer.getDomConfig();
config.setParameter("canonical-form", true);
But it will not always work, because "canonical-form=true" is optional

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Match specific html attribute values - java

Related

Match set of simple xpaths with SAX

Attempting to use Java & Xpath to updated the value of any selected node inside an XML doc

how to retrieve XML data using XPath which has a default namespace in Java?

using xpath in java to go through this list?

Java/DOM: Get the XML content of a node

Categories

Resources