xpaths not working in java - java

I am trying to access a url, get the html from it and use xpaths to get certain values from it. I am getting the html just fine and Jtidy seems to be cleaning it appropriately. However, when I try to get the desired values using xpaths, I get an empty NodeList back. I know my xpath expression is correct; I have tested it in other ways. Whats wrong with this code. Thanks for the help.
String url_string = base_url + countries[c];
URL url = new URL(url_string);
Tidy tidy = new Tidy();
tidy.setShowWarnings(false);
tidy.setXHTML(true);
tidy.setMakeClean(true);
Document doc = tidy.parseDOM(url.openStream(), null);
//tidy.pprint(doc, System.out);
String xpath_string = "id('catlisting')//a";
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile(xpath_string);
NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
System.out.println("size="+nodes.getLength());
for (int r=0; r<nodes.getLength(); r++) {
System.out.println(nodes.item(r).getNodeValue());
}

Try "//div[#id='catlisting']//a"

Related

The method createDOM not return document

I use HtmlCleaner 2.6.1 and Xpath to parse html page in Android application.
Here html page:
http://www.kino-govno.com/comments/42571-postery-kapitan-fillips-i-poslednij-rubezh
http://www.kino-govno.com/comments/42592-fantasticheskie-idei-i-mesta-ih-obitanija
The first link return document, is all right.The second link here in this place:
document = domSerializer.createDOM(tagNode);
returns nothing.
If you create a simple java project without android. That all works fine.
Here is the Code :
String queries = "//div[starts-with(#class, 'news_text op')]/p";
URL url = new URL(link2);
TagNode tagNode = new HtmlCleaner().clean(url);
CleanerProperties cleanerProperties = new CleanerProperties();
DomSerializer domSerializer = new DomSerializer(cleanerProperties);
document = domSerializer.createDOM(tagNode);
xPath = XPathFactory.newInstance().newXPath();
pageNode = (NodeList)xPath.evaluate(queries,document, XPathConstants.NODESET);
String val = pageNode.item(0).getFirstChild().getNodeValue();
That's because HtmlCleaner wraps the paragraphs of the second HTML page into another <div/>, so it is not a direct child any more. Use the descendent-or-self-axis // instead of the child-axis /:
//div[starts-with(#class, 'news_text op')]//p

How to parse html data using xpath in java

I am writing a java code where i am extracting specific data from a particular url using xpath expression.
After executing my code i am not getting the desired results.
Here is my code:
try{
URL oracle = new URL();
URLConnection yc = oracle.openConnection();
InputStream is = yc.getInputStream();
is = oracle.openStream();
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(is, null);
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
XPathExpression xPathExpression = xPath.compile("");
Object result = xPathExpression.evaluate(tidyDOM,XPathConstants.NODESET);
System.out.println(result.toString());
}catch(Exception e){
System.out.println("error");
}
output:
com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList#7e97d1ff
i want product price from this url: http://www.flipkart.com/d-link-8-port-10-100m-unmanaged-standalone-switch-network/p/itmdffym2nhwyzvz
and the xpath i am usng in my code is : /html/body/div/div[2]/div/div/div[3]/div/div/div[3]/div[2]/div/div/div/div/span
can anyone tell me what i am doing wrong?

Read sitemap with XPath

I want to read Sitemap with XPath but it doesn't work.
here is my code :
private void evaluate2(String src){
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
try{
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new ByteArrayInputStream(src.getBytes()));
System.out.println(src);
XPathFactory xp_factory = XPathFactory.newInstance();
XPath xpath = xp_factory.newXPath();
XPathExpression expr = xpath.compile("//url/loc");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
System.out.println(nodes.getLength());
for (int i = 0; i < nodes.getLength(); i++) {
items.add(nodes.item(i).getNodeValue());
System.out.println(nodes.item(i).toString());
}
}catch(Exception e){
System.out.println(e.getMessage());
}
}
Before I retrieve the remote source of the sitemap, and it's passed to evaluate2 through the variable src.
And the System.out.println(nodes.getLength()); display 0
My xpath query is working because this query work in PHP.
Do you see errors in my code ?
Thanks
You parse the sitemap with a namespace-aware parser (that's what factory.setNamespaceAware(true) does), but then attempt to access it using an XPath that does not usea namespace resolver (or reference any namespaces).
The simplest solution is to configure the parser as not namespace aware. As long as you're just parsing a self-contained sitemap, that shouldn't be a problem.
One more problem in your code is that you pass the sitemap contents as a String, then convert that String using the platform default encoding. This will work as long as your platform-default encoding matches that of the actual bytes that you retrieved from the server (assuming that you also created the string using the platform-default encoding). If it doesn't, you're likely to get a conversion error.
I think the input has namespace. So you would have to initialize the namespaceContext for the xpath object and change your xpath with prefixes. i.e. //usr/loc should be //ns:url/ns:loc
and then add the namespace prefix binding in the namespace object.
You can find an NamespaceContext implementation available with apache common. http://ws.apache.org/commons/util/apidocs/index.html
ws-commons-utils
NamespaceContextImpl namespaceContextObj = new NamespaceContextImpl();
nsContext.startPrefixMapping("ns", "http://sitename/xx");
xpath.setNamespaceContext(namespaceContextObj);
XPathExpression expr = xpath.compile("//ns:url/ns:loc");
In case you don't know what namespaces that are comming, you can get them from the document it self, but I doubt it ll be of much use. There are few how-tos here
http://www.ibm.com/developerworks/xml/library/x-nmspccontext/index.html
I can't see any errors in your code so I gues the problem is the source.
Are you sure that the source file contains this element?
Maybe you could try to use this code to parse the String in an Document
builder.parse(new InputSource(new StringReader(xml)));

Parsing using HTMLParser

Parser parser = new Parser();
parser.setInputHTML("d:/index.html");
parser.setEncoding("UTF-8");
NodeList nl = parser.parse(null);
/*
SimpleNodeIterator sNI=list.elements();
while(sNI.hasMoreNodes()){
System.out.println(sNI.nextNode().getText());}
*/
NodeList trs = nl.extractAllNodesThatMatch(new TagNameFilter("tr"),true);
for(int i=0;i<trs.size();i++) {
NodeList nodes = trs.elementAt(i).getChildren();
NodeList tds = nodes.extractAllNodesThatMatch(new TagNameFilter("td"),true);
System.out.println(tds.toString());
I am not getting any output, eclipse shows javaw.exe terminated.
Pass the path to the resource into the constructor.
Parser parser = new Parser("index.html");
Parse and print all the divs on this page:
Parser parser = new Parser("http://stackoverflow.com/questions/7293729/parsing-using-htmlparser/");
parser.setEncoding("UTF-8");
NodeList nl = parser.parse(null);
NodeList div = nl.extractAllNodesThatMatch(new TagNameFilter("div"),true);
System.out.println(div.toString());
parser.setInputHtml(String inputHtml) doesn't do what you think it does. It treats inputHtml as the html input to the parser. You use the constructor to point the parser at an html resource (file or URL).
Example:
Parser parser = new Parser();
parser.setInputHTML("<div>Foo</div><div>Bar</div>");

XML Searching and Parsing

I have an XML file that I am trying to search using Java. I just need to find an element by its Tag name and then find that Tag's value. So for example:
I have this XML file:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="https://company.com/test/xslt/processing_report.xslt"?>
<Certificate xmlns="urn:us:net:exchangenetwork:Company">
<Value1>Veggie</Value1>
<Value2>Fruits</Value2>
<type1>Apple</type1>
<FindME>Red</FindME>
<Value3>Bread</Value3>
</Certificate>
I want to find the value inside of the FindME Tag. I can't use XPath because different files can have different structures, but they always have a FindME tag. Lastly I am looking for the simplest piece of code, I do not care much about performance. Thank you
Here is the code:
XPathFactory f = XPathFactory.newInstance();
XPathExpression expr = f.newXPath().compile(
"//*[local-name() = 'FindME']/text()");
DocumentBuilderFactory domFactory = DocumentBuilderFactory
.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse("src/test.xml"); //your XML file
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
System.out.println(nodes.getLength());
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getNodeValue());
}
Explained :
//* - match any element node - no matter where they are
local-name() = 'FindME' - where local name - i.e; not the full path - matches 'FindME'
text() - get the node value.
I think you need to read up on XPath because it can very easily solve this problem. So can using getElementsByTagName in the DOM API.
You can still use XPath. All you need to do is use //FindMe (read here on // usage) expression. This finds a the "FindMe" elements from any where in the xml irrespective of its parent or path from the root.
If you are using namespaces then make sure you are making the parser aware of that
String findMeVal = null;
InputStream is = //...
XmlPullParser parser = //...
parser.setFeature(XmlPullParser.FEATURE_PROCESS_NAMESPACES, true);
parser.setInput(is, null);
int event;
while (XmlPullParser.END_DOCUMENT != (event = parser.next())) {
if (event == XmlPullParser.START_TAG) {
if ("FindME".equals(parser.getName())) {
findMeVal = parser.nextText();
break;
}
}
}

Categories

Resources