Java - XML Parsing using XPATH

Java - XML Parsing using XPATH - java

I have XML:
<Table>
<Row ss:Index="74" ss:AutoFitHeight="0" ss:Height="14">
<Cell ss:Index="1" ss:MergeAcross="3" ss:StyleID="s29">
<ss:Data ss:Type="Number" xmlns="http://www.w3.org/TR/REC-html40">
0.00
</ss:Data>
</Cell>
<Cell ss:Index="15" ss:MergeAcross="5" ss:StyleID="s29">
<ss:Data ss:Type="Number" xmlns="http://www.w3.org/TR/REC-html40">
4.57
</ss:Data>
</Cell>
</Row>
Here is code used to extract the content, eg. "0.00", based on row index & cell index:
public static String getCellValueNum(String filename, int rowIdx, int colIdx) {
// search for Table element anywhere in the source
String tableElementPattern = "//*[name()='Table']";
// search for Row element with given number
String rowPattern = String.format("/*[name()='Row' and #ss:Index='%d']", rowIdx) ;
// search for Cell element with given column number
String cellPattern = String.format("/*[name()='Cell' and #ss:Index='%d']", colIdx) ;
// search for element that has ss:Type="String" attribute, search for element with text under it and get text name
String cellStringContent = "/*[#ss:Type='Number']/*[text()]/text()";
String completePattern = tableElementPattern + rowPattern + cellPattern + cellStringContent;
try (FileReader reader = new FileReader(filename)) {
XPath xPath = getXpathProcessor();
Node n = (Node)xPath.compile(completePattern)
.evaluate(new InputSource(reader), XPathConstants.NODE);
if (n.getNodeType() == Node.TEXT_NODE) {
return n.getNodeValue().trim();
}
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
private static XPath getXpathProcessor() {
// this is where the custom implementation of NamespaceContext is used
NamespaceContext context = new NamespaceContextMap(
"html", "http://www.w3.org/TR/REC-html40",
//"xsl", "http://www.w3.org/1999/XSL/Transform",
"o", "urn:schemas-microsoft-com:office:office",
"x", "urn:schemas-microsoft-com:office:excel",
"ss", "urn:schemas-microsoft-com:office:spreadsheet");
XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setNamespaceContext(context);
return xpath;
}
It works perfectly fine when 'ss:Type='String'', But when ss:Type='Number' It gives error:
java.lang.NullPointerException
at XpathBill.getCellValueNum(XpathBill.java:55)
at XpathBill.main(XpathBill.java:100)
I think here:
if (n.getNodeType() == Node.TEXT_NODE)
It should be something else instead of TEXT_NODE, I tried other NodeType Named Constants, but it didnt work.
Please Help.
Thank you!

Related

Merging same elements in JSoup

I have the HTML string like
<b>test</b><b>er</b>
<span class="ab">continue</span><span> without</span>
I want to collapse the Tags which are similar and belong to each other. In the above sample I want to have
<b>tester</b>
since the tags have the same tag withouth any further attribute or style. But for the span Tag it should remain the same because it has a class attribute. I am aware that I can iterate via Jsoup over the tree.
Document doc = Jsoup.parse(input);
for (Element element : doc.select("b")) {
}
But I'm not clear how look forward (I guess something like nextSibling) but than how to collapse the elements?
Or exists a simple regexp merge?
The attributes I can specify on my own. It's not required to have a one-fits-for-all Tag solution.

My approach would be like this. Comments in the code
public class StackOverflow60704600 {
public static void main(final String[] args) throws IOException {
Document doc = Jsoup.parse("<b>test</b><b>er</b><span class=\"ab\">continue</span><span> without</span>");
mergeSiblings(doc, "b");
System.out.println(doc);
}
private static void mergeSiblings(Document doc, String selector) {
Elements elements = doc.select(selector);
for (Element element : elements) {
// get the next sibling
Element nextSibling = element.nextElementSibling();
// merge only if the next sibling has the same tag name and the same set of attributes
if (nextSibling != null && nextSibling.tagName().equals(element.tagName())
&& nextSibling.attributes().equals(element.attributes())) {
// your element has only one child, but let's rewrite all of them if there's more
while (nextSibling.childNodes().size() > 0) {
Node siblingChildNode = nextSibling.childNodes().get(0);
element.appendChild(siblingChildNode);
}
// remove because now it doesn't have any children
nextSibling.remove();
}
}
}
}
output:
<html>
<head></head>
<body>
<b>tester</b>
<span class="ab">continue</span>
<span> without</span>
</body>
</html>
One more note on why I used loop while (nextSibling.childNodes().size() > 0). It turned out for or iterator couldn't be used here because appendChild adds the child but removes it from the source element and remaining childen are be shifted. It may not be visible here but the problem will appear when you try to merge: <b>test</b><b>er<a>123</a></b>

I tried to update the code from #Krystian G but my edit was rejected :-/ Therefore I post it as an own post. The code is an excellent starting point but it fails if between the tags a TextNode appears, e.g.
<span> no class but further</span> (in)valid <span>spanning</span> would result into a
<span> no class but furtherspanning</span> (in)valid
Therefore the corrected code looks like:
public class StackOverflow60704600 {
public static void main(final String[] args) throws IOException {
String test1="<b>test</b><b>er</b><span class=\"ab\">continue</span><span> without</span>";
String test2="<b>test</b><b>er<a>123</a></b>";
String test3="<span> no class but further</span> <span>spanning</span>";
String test4="<span> no class but further</span> (in)valid <span>spanning</span>";
Document doc = Jsoup.parse(test1);
mergeSiblings(doc, "b");
System.out.println(doc);
}
private static void mergeSiblings(Document doc, String selector) {
Elements elements = doc.select(selector);
for (Element element : elements) {
Node nextElement = element.nextSibling();
// if the next Element is a TextNode but has only space ==> we need to preserve the
// spacing
boolean addSpace = false;
if (nextElement != null && nextElement instanceof TextNode) {
String content = nextElement.toString();
if (!content.isBlank()) {
// the next element has some content
continue;
} else {
addSpace = true;
}
}
// get the next sibling
Element nextSibling = element.nextElementSibling();
// merge only if the next sibling has the same tag name and the same set of
// attributes
if (nextSibling != null && nextSibling.tagName().equals(element.tagName())
&& nextSibling.attributes().equals(element.attributes())) {
// your element has only one child, but let's rewrite all of them if there's more
while (nextSibling.childNodes().size() > 0) {
Node siblingChildNode = nextSibling.childNodes().get(0);
if (addSpace) {
// since we have had some space previously ==> preserve it and add it
if (siblingChildNode instanceof TextNode) {
((TextNode) siblingChildNode).text(" " + siblingChildNode.toString());
} else {
element.appendChild(new TextNode(" "));
}
}
element.appendChild(siblingChildNode);
}
// remove because now it doesn't have any children
nextSibling.remove();
}
}
}
}

Unable to parse element attribute with XOM

I'm attempting to parse an RSS field using the XOM Java library. Each entry's image URL is stored as an attribute for the <img> element, as seen below.
<rss version="2.0">
<channel>
<item>
<title>Decision Paralysis</title>
<link>https://xkcd.com/1801/</link>
<description>
<img src="https://imgs.xkcd.com/comics/decision_paralysis.png"/>
</description>
<pubDate>Mon, 20 Feb 2017 05:00:00 -0000</pubDate>
<guid>https://xkcd.com/1801/</guid>
</item>
</channel>
</rss>
Attempting to parse <img src=""> with .getFirstChildElement("img") only returns a null pointer, making my code crash when I try to retrieve <img src= ...>. Why is my program failing to read in the <img> element, and how can I read it in properly?
import nu.xom.*;
public class RSSParser {
public static void main() {
try {
Builder parser = new Builder();
Document doc = parser.build ( "https://xkcd.com/rss.xml" );
Element rootElement = doc.getRootElement();
Element channelElement = rootElement.getFirstChildElement("channel");
Elements itemList = channelElement.getChildElements("item");
// Iterate through itemList
for (int i = 0; i < itemList.size(); i++) {
Element item = itemList.get(i);
Element descElement = item.getFirstChildElement("description");
Element imgElement = descElement.getFirstChildElement("img");
// Crashes with NullPointerException
String imgSrc = imgElement.getAttributeValue("src");
}
}
catch (Exception error) {
error.printStackTrace();
System.exit(1);
}
}
}

There is no img element in the item. Try
if (imgElement != null) {
String imgSrc = imgElement.getAttributeValue("src");
}
What the item contains is this:
<description><img
src="http://imgs.xkcd.com/comics/us_state_names.png"
title="Technically DC isn't a state, but no one is too
pedantic about it because they don't want to disturb the snakes
."
alt="Technically DC isn't a state, but no one is too pedantic about it because they don't want to disturb the snakes." />
</description>
That's not an img elment. It's plain text.

I managed to come up with a somewhat hacky solution using regex and pattern matching.
// Iterate through itemList
for (int i = 0; i < itemList.size(); i++) {
Element item = itemList.get(i);
String descString = item.getFirstChildElement("description").getValue();
// Parse image URL (hacky)
String imgSrc = "";
Pattern pattern = Pattern.compile("src=\"[^\"]*\"");
Matcher matcher = pattern.matcher(descString);
if (matcher.find()) {
imgSrc = descString.substring( matcher.start()+5, matcher.end()-1 );
}
}

Reading XML tags in java, code optimization

What I am actually doing is a recursive function which reads the tags in the xml. Below is the code:
private void readTag(org.w3c.dom.Node item, String histoTags, String fileName, Hashtable<String, String> tagsInfos) {
try {
if (item.getNodeType() == Node.ELEMENT_NODE) {
NodeList itemChilds = item.getChildNodes();
for (int i=0; i < itemChilds.getLength(); i++) {
org.w3c.dom.Node itemChild = itemChilds.item(i);
readTag(itemChild, histoTags + "|" + item.getNodeName(), fileName, tagsInfos);
}
}
else if (item.getNodeType() == Node.TEXT_NODE) {
tagsInfosSoft.put(histoTags, item.getNodeValue());
}
}
This function takes some time to execute. The xml the function reads is in this format:
<?xml version="1.0" encoding="UTF-8"?>
<Document>
<Mouvement>
<Com>
<IdCom>32R01000000772669473</IdCom>
<RefCde>32R</RefCde>
<Edit>0</Edit>
<Com>
<Mouvement>
<Document>
Is there any way of optimizing this code in java?

Two optimizations, don't know how much they will help:
Don't use getChildNodes(). Use getFirstChild() and getNextSibling().
Reuse a single StringBuilder instead of creating a new one for every element (implicitly done by histoTags + "|" + item.getNodeName()).
But, you should also be aware that the text content of an element node may seen as a combination of multiple TEXT and CDATA nodes.
Your code will also work better if it works on elements, not nodes.
private static void readTag(Element elem, StringBuilder histoTags, String fileName, Hashtable<String, String> tagsInfos) {
int histoLen = histoTags.length();
CharSequence textContent = null;
boolean hasChildElement = false;
for (Node child = elem.getFirstChild(); child != null; child = child.getNextSibling()) {
switch (child.getNodeType()) {
case Node.ELEMENT_NODE:
histoTags.append('|').append(child.getNodeName());
readTag((Element)child, histoTags, fileName, tagsInfos);
histoTags.setLength(histoLen);
hasChildElement = true;
break;
case Node.TEXT_NODE:
case Node.CDATA_SECTION_NODE:
//uncomment to test: System.out.println(histoTags + ": \"" + child.getTextContent() + "\"");
if (textContent == null)
// Optimization: Don't copy to a StringBuilder if only one text node will be found
textContent = child.getTextContent();
else if (textContent instanceof StringBuilder)
// Ok, now we need a StringBuilder to collect text from multiple nodes
((StringBuilder)textContent).append(child.getTextContent());
else
// And we keep collecting text from multiple nodes
textContent = new StringBuilder(textContent).append(child.getTextContent());
break;
default:
// ignore all others
}
}
if (textContent != null) {
String text = textContent.toString();
// Suppress pure whitespace content on elements with child elements, i.e. structural whitespace
if (! hasChildElement || ! text.trim().isEmpty())
tagsInfos.put(histoTags.toString(), text);
}
}
Test
String xml = "<root>\n" +
" <tag>hello <![CDATA[world]]> Foo <!-- comment --> Bar</tag>\n" +
"</root>\n";
Element docElem = DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.parse(new InputSource(new StringReader(xml)))
.getDocumentElement();
Hashtable<String, String> tagsInfos = new Hashtable<>();
readTag(docElem, new StringBuilder(docElem.getNodeName()), "fileName", tagsInfos);
System.out.println(tagsInfos);
Output (with print uncommented)
root: "
"
root|tag: "hello "
root|tag: "world"
root|tag: " Foo "
root|tag: " Bar"
root: "
"
{root|tag=hello world Foo Bar}
See how splitting the text inside the <tag> node using CDATA and comments caused the DOM node to contain multiple TEXT/CDATA child nodes.

Empty / Null Nodes returned from getChildNodes

I'm trying to parse the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<docusign-cfg>
<tagConfig>
<tags>
<approve>approve</approve>
<checkbox>checkbox</checkbox>
<company>company</company>
<date>date</date>
<decline>decline</decline>
<email>email</email>
<emailAddress>emailAddress</emailAddress>
<envelopeID>envelopeID</envelopeID>
<firstName>firstName</firstName>
<lastName>lastName</lastName>
<number>number</number>
<ssn>ssn</ssn>
<zip>zip</zip>
<signHere>signHere</signHere>
<checkbox>checkbox</checkbox>
<initialHere>initialHere</initialHere>
<dateSigned>dateSigned</dateSigned>
<fullName>fullName</fullName>
</tags>
</tagConfig>
</docusign-cfg>
I want to read either the name or content of each tag in the <tags> tag. I can do so with the following code:
public String[] getAvailableTags() throws Exception
{
String path = "/docusign-cfg/tagConfig/tags";
XPathFactory f = XPathFactory.newInstance();
XPath x = f.newXPath();
Object result = null;
try
{
XPathExpression expr = x.compile(path);
result = expr.evaluate(doc, XPathConstants.NODE);
}
catch (XPathExpressionException e)
{
throw new Exception("An error ocurred while trying to retrieve the tags");
}
Node node = (Node) result;
NodeList childNodes = node.getChildNodes();
String[] tags = new String[childNodes.getLength()];
System.out.println(tags.length);
for(int i = 0; i < tags.length; i++)
{
String content = childNodes.item(i).getNodeName().trim().replaceAll("\\s", "");
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags[i] = content;
}
}
return tags;
}
After some searching I found that parsing it this way causes it to read whitespace between nodes / tags causes those whitespaces to be read as children. In this case the whitespaces are considered children of <tags> .
My output:
37
null
approve
null
checkbox
null
company
null
date
null
decline
null
email
null
emailAddress
null
envelopeID
null
firstName
null
lastName
null
number
null
ssn
null
zip
null
signHere
null
checkbox
null
initialHere
null
dateSigned
null
fullName
null
37 is the number of nodes it found in <tags>
Everything below 37 is the content of the tag array.
How are these null elements being added to the tag array despite my checking for null?

I think that is because of the indexing of tag. The if check also skips an index. So even though value is not being inserted it will result in null. Use separate index for tag array
int j = 0;
for(int i = 0; i < tags.length; i++)
{
String content = childNodes.item(i).getNodeName().trim().replaceAll("\\s", "");
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags[j++] = content;
}
}
Since you are omitting some of the child nodes, creating an array of entire child nodes length may result in wastage of memory. You can use a List instead. If you are particular about String array you can later convert this to an array as well.
public String[] getAvailableTags() throws Exception
{
String path = "/docusign-cfg/tagConfig/tags";
XPathFactory f = XPathFactory.newInstance();
XPath x = f.newXPath();
Object result = null;
try
{
XPathExpression expr = x.compile(path);
result = expr.evaluate(doc, XPathConstants.NODE);
}
catch (XPathExpressionException e)
{
throw new Exception("An error ocurred while trying to retrieve the tags");
}
Node node = (Node) result;
NodeList childNodes = node.getChildNodes();
List<String> tags = new ArrayList<String>();
for(int i = 0; i < tags.length; i++)
{
String content = childNodes.item(i).getNodeName().trim().replaceAll("\\s", "");
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags.add(content);
}
}
String[] tagsArray = tags.toArray(new String[tags.size()]);
return tagsArray;
}

The contents of tag array defaults to null.
So it is not a case of how does the element become null, it is the case of it being left as null.
To prove this to yourself, add the following else block like this:
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags[i] = content;
} else {
tags[i] = "Foo Bar";
}
You should now see 'Foo Bar' instead of null.
A better solution here would be to use an ArrayList, and append the tags to it instead of using an array. Then you do not need to track the indexes and so less chance of this type of bug.

how can i get data out of DIV using html parser in java

i am using Java html parser(link text) to try to parse this line.
<td class=t01 align=right><div id="OBJ123" name=""></div></td>
But I am looking for the value like I see on my web browser, which is a number. Can you help me get the value?
Please let me know if you need more details.
Thanks

From the documentation, all you have to do is find all of the DIV elements that also have an id of OBJ123 and take the first result's value.
NodeList nl = parser.parse(null); // you can also filter here
NodeList divs = nl.extractAllNodesThatMatch(
new AndFilter(new TagNameFilter("DIV"),
new HasAttributeFilter("id", "OBJ123")));
if( divs.size() > 0 ) {
Tag div = divs.elementAt(0);
String text = div.getText(); // this is the text of the div
}
UPDATE: if you're looking at the ajax url, you can use similar code like:
// make some sort of constants for all the positions
const int OPEN_PRICE = 0;
const int HIGH_PRICE = 1;
const int LOW_PRICE = 2;
// ....
NodeList nl = parser.parse(null); // you can also filter here
NodeList values = nl.extractAllNodesThatMatch(
new AndFilter(new TagNameFilter("TD"),
new HasAttributeFilter("class", "t1")));
if( values.size() > 0 ) {
Tag openPrice = values.elementAt(OPEN_PRICE);
String openPriceValue = openPrice.getText(); // this is the text of the div
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - XML Parsing using XPATH - java

Related

Merging same elements in JSoup

Unable to parse element attribute with XOM

Reading XML tags in java, code optimization

Empty / Null Nodes returned from getChildNodes

how can i get data out of DIV using html parser in java

Categories

Resources