Unable to parse element attribute with XOM - java

I'm attempting to parse an RSS field using the XOM Java library. Each entry's image URL is stored as an attribute for the <img> element, as seen below.
<rss version="2.0">
<channel>
<item>
<title>Decision Paralysis</title>
<link>https://xkcd.com/1801/</link>
<description>
<img src="https://imgs.xkcd.com/comics/decision_paralysis.png"/>
</description>
<pubDate>Mon, 20 Feb 2017 05:00:00 -0000</pubDate>
<guid>https://xkcd.com/1801/</guid>
</item>
</channel>
</rss>
Attempting to parse <img src=""> with .getFirstChildElement("img") only returns a null pointer, making my code crash when I try to retrieve <img src= ...>. Why is my program failing to read in the <img> element, and how can I read it in properly?
import nu.xom.*;
public class RSSParser {
public static void main() {
try {
Builder parser = new Builder();
Document doc = parser.build ( "https://xkcd.com/rss.xml" );
Element rootElement = doc.getRootElement();
Element channelElement = rootElement.getFirstChildElement("channel");
Elements itemList = channelElement.getChildElements("item");
// Iterate through itemList
for (int i = 0; i < itemList.size(); i++) {
Element item = itemList.get(i);
Element descElement = item.getFirstChildElement("description");
Element imgElement = descElement.getFirstChildElement("img");
// Crashes with NullPointerException
String imgSrc = imgElement.getAttributeValue("src");
}
}
catch (Exception error) {
error.printStackTrace();
System.exit(1);
}
}
}

There is no img element in the item. Try
if (imgElement != null) {
String imgSrc = imgElement.getAttributeValue("src");
}
What the item contains is this:
<description><img
src="http://imgs.xkcd.com/comics/us_state_names.png"
title="Technically DC isn't a state, but no one is too
pedantic about it because they don't want to disturb the snakes
."
alt="Technically DC isn't a state, but no one is too pedantic about it because they don't want to disturb the snakes." />
</description>
That's not an img elment. It's plain text.

I managed to come up with a somewhat hacky solution using regex and pattern matching.
// Iterate through itemList
for (int i = 0; i < itemList.size(); i++) {
Element item = itemList.get(i);
String descString = item.getFirstChildElement("description").getValue();
// Parse image URL (hacky)
String imgSrc = "";
Pattern pattern = Pattern.compile("src=\"[^\"]*\"");
Matcher matcher = pattern.matcher(descString);
if (matcher.find()) {
imgSrc = descString.substring( matcher.start()+5, matcher.end()-1 );
}
}

Related

remove element by position

I have an xml which has a simple set of data.
This data is displayed in a simple table and each row of data is assigned an ID in the table based on the position in the xml ( <xsl:value-of select="position()"
/> ). I cant add an id attribute to the data because its not my data, but I need to locate elements based on this position and remove them.
public class Delete extends HttpServlet {
private final String XML_FILE = "data.xml";
public void init() throws ServletException {
}
public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
// Disable browser caching
response.setHeader("Cache-Control", "private, no-store, no-cache, must-revalidate");
response.setHeader("Pragma", "no-cache");
response.setDateHeader("Expires", 0);
String index = request.getParameter("delete");
try {
// Load the current data.xml
SAXBuilder builder = new SAXBuilder();
Document xml_document = builder.build(new File(getServletContext().getRealPath("/") + XML_FILE));
Element root = xml_document.getRootElement();
root.removeChild(index);
XMLOutputter outputter = new XMLOutputter(Format.getPrettyFormat());
outputter.output(xml_document, new FileWriter(getServletContext().getRealPath("/") + XML_FILE));
}
catch(Exception ex) {}
// Once we have processed the input we were given
// redirect the web browser to the main page.
response.sendRedirect("/");
}
public void destroy() {
}
}
This code does not remove the correct data. Anyone know how to find the child of the root element by its position?
#rolfl
int index = Integer.parseInt(delete);
Element root = xml_document.getRootElement();
root.getChildren().remove(index);
This does not remove any elements.
Your problem here is that the process is getting the index to remove as a string, and that's then calling the removeChild(String) method .... which looks for the first child that has an element tag name of whatever (string) value is in the index.
What you want to do, instead, is to convert the index to an int, and then treat the children of the root as a List.... something like:
int index = Integer.parseInt(request.getParameter("delete"));
root.getChildren().remove(index);
See the documentation for getChildren().
This is how I got it to work. Not sure if its a great solution but it works.
SAXBuilder builder = new SAXBuilder();
Document xml_document = builder.build(new File(getServletContext().getRealPath("/") + XML_FILE));
// Get root element
Element root = xml_document.getRootElement();
// Create a list of the children of the root element
List<Element> kids = root.getChildren();
// Interate through list of elements and delete (detach) the element at position index.
int i = 1;
for (Element element : kids)
{
if(i == index)
{
element.detach();
break;
}
else
{
i = i + 1;
}
}
I got the root element with
Element root = xml_document.getRootElement();
Made a list of it's children elements with
List<Element> kids = root.getChildren();
Then iterated through this list until I reached the index of the element to delete then did .detach on this element
int i = 1;
for (Element element : kids)
{
if(i == index)
{
element.detach();
break;
}
else
{
i = i + 1;
}
}
If anyone can update this to show an easier way to remove the element please do so. It feels like there must be an easier way to detach an element without the iteration. Anyway, as I said it works.

Java - XML Parsing using XPATH

I have XML:
<Table>
<Row ss:Index="74" ss:AutoFitHeight="0" ss:Height="14">
<Cell ss:Index="1" ss:MergeAcross="3" ss:StyleID="s29">
<ss:Data ss:Type="Number" xmlns="http://www.w3.org/TR/REC-html40">
0.00
</ss:Data>
</Cell>
<Cell ss:Index="15" ss:MergeAcross="5" ss:StyleID="s29">
<ss:Data ss:Type="Number" xmlns="http://www.w3.org/TR/REC-html40">
4.57
</ss:Data>
</Cell>
</Row>
Here is code used to extract the content, eg. "0.00", based on row index & cell index:
public static String getCellValueNum(String filename, int rowIdx, int colIdx) {
// search for Table element anywhere in the source
String tableElementPattern = "//*[name()='Table']";
// search for Row element with given number
String rowPattern = String.format("/*[name()='Row' and #ss:Index='%d']", rowIdx) ;
// search for Cell element with given column number
String cellPattern = String.format("/*[name()='Cell' and #ss:Index='%d']", colIdx) ;
// search for element that has ss:Type="String" attribute, search for element with text under it and get text name
String cellStringContent = "/*[#ss:Type='Number']/*[text()]/text()";
String completePattern = tableElementPattern + rowPattern + cellPattern + cellStringContent;
try (FileReader reader = new FileReader(filename)) {
XPath xPath = getXpathProcessor();
Node n = (Node)xPath.compile(completePattern)
.evaluate(new InputSource(reader), XPathConstants.NODE);
if (n.getNodeType() == Node.TEXT_NODE) {
return n.getNodeValue().trim();
}
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
private static XPath getXpathProcessor() {
// this is where the custom implementation of NamespaceContext is used
NamespaceContext context = new NamespaceContextMap(
"html", "http://www.w3.org/TR/REC-html40",
//"xsl", "http://www.w3.org/1999/XSL/Transform",
"o", "urn:schemas-microsoft-com:office:office",
"x", "urn:schemas-microsoft-com:office:excel",
"ss", "urn:schemas-microsoft-com:office:spreadsheet");
XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setNamespaceContext(context);
return xpath;
}
It works perfectly fine when 'ss:Type='String'', But when ss:Type='Number' It gives error:
java.lang.NullPointerException
at XpathBill.getCellValueNum(XpathBill.java:55)
at XpathBill.main(XpathBill.java:100)
I think here:
if (n.getNodeType() == Node.TEXT_NODE)
It should be something else instead of TEXT_NODE, I tried other NodeType Named Constants, but it didnt work.
Please Help.
Thank you!

Java XML DOM error when adding elements

I am trying to replicate this XML:
<?xml version="1.0"?>
<AccessRequest xml:lang="en-US">
<AccessLicenseNumber>YourLicenseNumber</AccessLicenseNumber>
<UserId>YourUserID</UserId>
<Password>YourPassword</Password>
</AccessRequest>
<?xml version="1.0"?>
<AddressValidationRequest xml:lang="en-US">
<Request>
<TransactionReference>
<CustomerContext>Your Test Case Summary Description</CustomerContext>
<XpciVersion>1.0</XpciVersion>
</TransactionReference>
<RequestAction>XAV</RequestAction>
<RequestOption>3</RequestOption>
</Request>
<AddressKeyFormat>
<AddressLine>AIRWAY ROAD SUITE 7</AddressLine>
<PoliticalDivision2>SAN DIEGO</PoliticalDivision2>
<PoliticalDivision1>CA</PoliticalDivision1>
<PostcodePrimaryLow>92154</PostcodePrimaryLow>
<CountryCode>US</CountryCode>
</AddressKeyFormat>
</AddressValidationRequest>
I am using one class to build the request:
public UpsRequestBuilder()
{
try
{
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
doc = docBuilder.newDocument();
}
catch(Exception e)
{
System.out.println(e.getMessage());
}
}
public void accessRequestBuilder(String accessKey, String username, String password)
{
Element accessRequest = doc.createElement("AccessRequest");
doc.appendChild(accessRequest);
Element license = doc.createElement("AccessLicenseNumber");
accessRequest.appendChild(license);
license.setTextContent(accessKey);
Element userId = doc.createElement("UserId");
accessRequest.appendChild(userId);
userId.setTextContent(username);
Element pass = doc.createElement("Password");
accessRequest.appendChild(pass);
pass.setTextContent(password);
System.out.println("completed Requestbuilder");
}
public void addAddress(Address address)
{
Element addressKeyFormat = doc.createElement("AddressKeyFormat");
doc.appendChild(addressKeyFormat);
Element addressLine = doc.createElement("AddressLine");
addressKeyFormat.appendChild(addressLine);
addressLine.setTextContent(address.getState() + ' ' + address.getStreet2());
Element city = doc.createElement("PoliticalDivision2");
addressKeyFormat.appendChild(city);
city.setTextContent(address.getCity());
Element state = doc.createElement("PoliticalDivision1");
addressKeyFormat.appendChild(state);
state.setTextContent(address.getState());
Element zip = doc.createElement("PostcodePrimaryLow");
addressKeyFormat.appendChild(zip);
zip.setTextContent(address.getZip());
Element country = doc.createElement("CountryCode");
addressKeyFormat.appendChild(country);
country.setTextContent(address.getCountry());
System.out.println("completed addAddress");
}
public void validateAddressRequest(String customerContextString, String action)
{
Element addressValidation = doc.createElement("AddressValidationRequest");
doc.appendChild(addressValidation);
Element transactionReference = doc.createElement("TransactionReference");
addressValidation.appendChild(transactionReference);
Element customerContext = doc.createElement("CustomerContext");
Element version = doc.createElement("XpciVersion");
transactionReference.appendChild(customerContext);
customerContext.setTextContent(customerContextString); //TODO figure out a way to optionally pass context text
transactionReference.appendChild(version);
version.setTextContent("1.0");//change this if the api version changes
Element requestAction = doc.createElement("RequestAction");
addressValidation.appendChild(requestAction);
requestAction.setTextContent(action);
System.out.println("completed validateAddressRequest");
}
And this is the function that uses it:
public void validateAddress(Address address)
{
UpsRequestBuilder request = new UpsRequestBuilder();
request.accessRequestBuilder(accessKey, username, password);
request.validateAddressRequest("", "3");
request.addAddress(address);
System.out.println(request.toString());
}
When I try and print out the XML from this, I get the error "HIERARCHY_REQUEST_ERR: An attempt was made to insert a node where it is not permitted." It happens in the validateAddressRequest function when I try and add the addressValidation element to the document (doc). Here is the exact line:
doc.appendChild(addressValidation);
what is the problem with adding this element to the document?
what is the problem with adding this element to the document?
You're trying to add it at the top level of the document. You can't do that, as the document already has a root element. Any XML document can only have a single root element.
The XML you've shown at the top of your question isn't a single XML document - it's two.

Parsing with htmlcleaner

I developed a method which allows you to extract items from a specific class using htmlcleaner now I was wondering...
How would you be able to extract the body and all its elements inside an html using htmlcleaner?
public String htmlParser(String html){
TagNode rootNode;
HtmlCleaner html_cleaner = new HtmlCleaner();
rootNode = html_cleaner.clean(html);
TagNode[] items = rootNode.getElementsByName("body", true);
ParseBody(items[0]);
html = item_found;
return html;
}
String item_found;
public void ParseBody(TagNode root){
if(root.getAllElements(true).length > 0){
for(TagNode node: root.getAllElements(true)){
ParseBody(node);
}
}else{
item_found = item_found + root.toString();// root.toString() only brings out the first name inside TagNode
- In here I wanted just the text of all items in the body but it would still be beneficial for everyone if the question is complete-
//if(root.getText().toString() != null || !(root.getText().toString().equals("null"))){
//item_found = item_found + root.getText().toString();
//}
}
}

how can i get data out of DIV using html parser in java

i am using Java html parser(link text) to try to parse this line.
<td class=t01 align=right><div id="OBJ123" name=""></div></td>
But I am looking for the value like I see on my web browser, which is a number. Can you help me get the value?
Please let me know if you need more details.
Thanks
From the documentation, all you have to do is find all of the DIV elements that also have an id of OBJ123 and take the first result's value.
NodeList nl = parser.parse(null); // you can also filter here
NodeList divs = nl.extractAllNodesThatMatch(
new AndFilter(new TagNameFilter("DIV"),
new HasAttributeFilter("id", "OBJ123")));
if( divs.size() > 0 ) {
Tag div = divs.elementAt(0);
String text = div.getText(); // this is the text of the div
}
UPDATE: if you're looking at the ajax url, you can use similar code like:
// make some sort of constants for all the positions
const int OPEN_PRICE = 0;
const int HIGH_PRICE = 1;
const int LOW_PRICE = 2;
// ....
NodeList nl = parser.parse(null); // you can also filter here
NodeList values = nl.extractAllNodesThatMatch(
new AndFilter(new TagNameFilter("TD"),
new HasAttributeFilter("class", "t1")));
if( values.size() > 0 ) {
Tag openPrice = values.elementAt(OPEN_PRICE);
String openPriceValue = openPrice.getText(); // this is the text of the div
}

Categories

Resources