Unable to retrieve web data in java using tidy and Xpath

Unable to retrieve web data in java using tidy and Xpath - java

What I am trying to do is scrape a simple inner HTML from a XHTML file.
I have narrowed down my search to the element node, but I fail to retrieve the information.
PLEASE NOTE: the element node has no child node. I get a null pointer exception for doing that
here is the HTML SNIPPET
<div id="dvTitle" class="titlebtmbrdr01" style="line-height: 22px;">BAJAJ AUTO LTD. </div>
PLease also NOTE that this file has namespace as
http://www.w3.org/1999/xhtml
You can see that I have the div element from which I want BAJAJ AUTO LTD.
Here is the code that i am using
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Vector;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import jxl.read.biff.BiffException;
import jxl.write.WriteException;
import jxl.write.biff.RowsExceededException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.Text;
import com.sun.org.apache.xml.internal.serialize.Serializer;
public class BSEQuotesExtractor implements valueExtractor {
#Override
public Vector<String> getName(Document d) throws XPathExpressionException, RowsExceededException, BiffException, WriteException, IOException {
// TODO Auto-generated method stub
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
xpath.setNamespaceContext(new MynamespaceContext());
Object result = xpath.evaluate("//*[#id='dvTitle']",d, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
System.out.println(nodes.getLength());
System.out.println(nodes.item(0).getNodeName());
System.out.println(nodes.item(0).getAttributes().item(1).getNodeName());
System.out.println(nodes.item(0).getAttributes().item(1).getNodeValue());
System.out.println(nodes.item(0).getTextContent());
return null;
}
public static void main(String[] args) throws MalformedURLException, IOException, XPathExpressionException, RowsExceededException, BiffException, WriteException{
BSEQuotesExtractor q = new BSEQuotesExtractor();
DOMParser parser = new DOMParser(new URL("http://www.bseindia.com/bseplus/StockReach/StockQuote/Equity/BAJAJ%20AUTO%20LTD/BAJAJAUT/532977/Scrips").openStream());
Document d = parser.getDocument();
q.getName(d);
}
}
And this is the output I get
1
div
dvTitle
null
Now why do I get that null? I should get BAJAJ AUTO LTD.

When I open the page your code references, that div actually is empty for me:
<div class="titlebtmbrdr01" id="dvTitle" style="line-height: 22px;"></div>
So perhaps you should save the page content to some file to examine if it is the same for you. If it is, but your browser displays things differently, then figure out what combination of cookies and other headers makes a difference there.

Related

XPath concat same tagname

I'm trying with this attempt to produce an xml based on the one given, joining values of same TagName.
For example this is what I've done so far:
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
public class TestXPath {
public static void main(String[] args) throws SAXException, IOException, ParserConfigurationException, XPathExpressionException {
String xml =
"<ROOT>" +
" <coolnessId>9</coolnessId>" +
" <cars id=\"3\">0</cars>" +
" <cars id=\"2\">1</cars>" +
" <cars id=\"1\">2</cars>" +
"</ROOT>";
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
Document doc = factory.newDocumentBuilder().parse(new ByteArrayInputStream(xml.getBytes(StandardCharsets.UTF_8)));
XPath xpath = XPathFactory.newInstance().newXPath();
///XPathExpression expr = xpath.compile("concat(//ROOT/cars,'-',//ROOT/coolnessId)");//concat(//ROOT/cars)
XPathExpression expr = xpath.compile("concat(//ROOT/cars,'-')");//concat(//ROOT/cars)
// XPathExpression expr = xpath.compile( "concat(//*[contains(name(), 'cars')],'')");
System.out.println(expr.evaluate(doc, XPathConstants.STRING));
}
}
This code produces:
0-
Now this is what should be:
2-1-0
As you can see the values follow the attribute "id" of each "cars" tag.
I've rearrenged many times but can't achieve my result.
Please keep in mind I'm on a very old enviroment such as Java 1.4 runtime.

I think it's going to be simplest to retrieve the nodes using XPath, and then concatenate the string values in Java code.
Any other solution involves upgrading your technology: XSLT, XPath 2.0+, etc, and that isn't going to be easy on a JDK 1.4 platform.

Xpath 2.0 functions not working in Java using Saxon

I'm writing a simple code to scrape data from the web page using selenium and xpath2.0 function.
Since Selenium supports only xpath1.0 functions, I am trying to use Saxon.jar
I have downloaded and extracted the Saxon9he.jar files into the path "C:\Program Files\Java\jre1.8.0_111\lib\ext"
I have created a file "jaxp.properties" with the following lines:
javax.xml.transform.TransformerFactory = net.sf.saxon.TransformerFactoryImpl
javax.xml.xpath.XPathFactory","net.sf.saxon.xpath.XPathFactoryImpl
Also included my jar files in the eclipse library.
But, I am not able to fetch the values with the Xpath2.0 functions.
In my code, if I use
XPathFactory factory = XPathFactory.newInstance();
instead of
XPathFactory factory = XPathFactory.newInstance(NamespaceConstant.OBJECT_MODEL_SAXON);
I am able to use the xpath1.0 functions. But I need Xpath2.0 function. please guide me in this.
My code is:
import java.io.IOException;
import java.io.StringReader;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import javax.xml.xpath.XPathFactoryConfigurationException;
import javax.xml.xpath.XPathFunctionResolver;
import javax.xml.xpath.XPathVariableResolver;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import net.sf.saxon.lib.NamespaceConstant;
import net.sf.saxon.xpath.XPathFactoryImpl;
public class XpathCheckClass {
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException, XPathFactoryConfigurationException, XPathExpressionException{
WebDriver dr = new FirefoxDriver();
dr.get("http://s15.a2zinc.net/clients/hartenergy/midstream17/Public/eBooth.aspx?Nav=False&BoothID=137384");
try {
Thread.sleep(3000);
} catch (Exception e) {
}
String source = dr.getPageSource();
Document doc = null;
try {
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
doc = db.parse( new InputSource( new StringReader(source)));
} catch (Exception e) {
e.printStackTrace();
}
System.setProperty("javax.xml.xpath.XPathFactory:"+NamespaceConstant.OBJECT_MODEL_SAXON, "net.sf.saxon.xpath.XPathFactoryImpl");
XPathFactory factory = XPathFactory.newInstance(NamespaceConstant.OBJECT_MODEL_SAXON);
// XPathFactory factory = XPathFactory.newInstance(); ---> default xpath factory
XPath xpath = factory.newXPath();
XPathExpression expr = xpath.compile("if(//h2) then //h2 else //h1");
NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
System.out.println(nodes.getLength());
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
dr.close();
}
}

Recent releases of Saxon no longer advertise themselves as JAXP XPath services, so you need to instantiate the XPath factory explicitly:
XPathFactory xf = new net.sf.saxon.XPathFactoryImpl();

xml file is successfully parsed from netbeans but not from runnable jar

i developed a code that parses an xml file and returns result in a frame in netbeans .The code runs successfully from netbeans but when exporting the project into jar file it shows nothing .Please ,if you have ideas ,i will be thankful if you help me .Here is the code i used for parsing .
import java.io.IOException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.xml.sax.SAXException;
public class Parsing {
String a = null;
public static void main(String a) throws
ParserConfigurationException, SAXException, IOException,
XPathExpressionException {
a = getElem(a);
}
public static String getElem(String a) throws ParserConfigurationException, SAXException, XPathExpressionException, IOException {
String file = "src/xml/read.xml";
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(false);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(file);
XPath xpath = XPathFactory.newInstance().newXPath();
Node CustomerId = (Node) xpath.evaluate("//Operation[#name='Read' and #modifier='Customer']/ParameterList/StringParameter[#name='CustomerId']/text()",
doc.getDocumentElement(), XPathConstants.NODE);
a = CustomerId.getNodeValue();
return a;
}
}
when calling the method getElem(a) from another frame it shows me the value of a in a textbox ,but when exporting the project into a jar file it doesn't show me anything !

First thing that strikes me is the use of relative path to your XML resource. It seems you rely on existence of a file in a directory relative to your working dir.

The src directory will not exist at run time. This suggests that the XML file is an em ebbed resource, and as such, can't get accessed as a file would be if it lived on the file system.
Instead you want to use something like...
getClass().getResourceAsStream("/xml/read.xml")
And pass the resulting InputStream to DocumentBuilder. Don't forget to close it

here is the modified code that worked with jar file :)
import java.io.IOException;
import java.io.InputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class Parsing {
String a = null;
private static final String TEST_XML = "/xml/read.xml";
public static void main(String a) throws
ParserConfigurationException, SAXException, IOException,
XPathExpressionException {
a = getElem(a);
}
protected static InputSource getTestXMLInputSource() {
InputStream is = Parsing.class.getResourceAsStream(TEST_XML);
is = Parsing.class.getResourceAsStream(TEST_XML);
return new InputSource(is);
}
public static String getElem(String a) throws ParserConfigurationException, SAXException, XPathExpressionException, IOException {
final InputSource source = getTestXMLInputSource();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(false);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(source);
XPath xpath = XPathFactory.newInstance().newXPath();
Node CustomerId = (Node) xpath.evaluate("//Operation[#name='Read' and #modifier='Customer']/ParameterList/StringParameter[#name='CustomerId']/text()",
doc.getDocumentElement(), XPathConstants.NODE);
a = CustomerId.getNodeValue();
return a;
}
}

Issues with xpath that contained namespaces in Java (Dom parser)

I'm trying to parse an rdfs xml file in order to find all the Classes in an rdfs file.
The xpath: "/rdf:RDF/rdfs:Class"
is working in my XML editor.
When i insert the xpath in my Java program (i have implemented a dom parser), i get 0 Classes.
The following example runs but it outputs 0 classes!
I do:
import java.io.IOException;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPathExpressionException;
import org.xml.sax.SAXException;
public class Main {
public static void main(String args[]) throws XPathExpressionException, ParserConfigurationException, SAXException, IOException{
FindClasses FSB = new FindClasses();
FSB.FindAllClasses("C:\\Workspace\\file.xml"); //rdfs file
}
}
The class FindClasses is as follows:
import java.io.IOException;
import java.util.Collection;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
public class FindClasses {
public void FindAllClasses(String fileName) throws XPathExpressionException, ParserConfigurationException, SAXException, IOException {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse(fileName);
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression classes_expr = xpath.compile("/rdf:RDF/rdfs:Class");
Object result = classes_expr.evaluate(doc, XPathConstants.NODESET);
NodeList classes = (NodeList) result;
System.out.println("I found : " + classes.getLength() + " classes " );
}
}
The rdfs file is:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xml:lang="en" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<rdfs:Class rdf:about="Class1">
</rdfs:Class>
<rdfs:Class rdf:about="Class2">
</rdfs:Class>
</rdf:RDF>
I don't really understand why the xpath returns 0 nodes in that example.
It's weird, cause i have implemented other dom parsers as well and they were working fine.
Can somebody help me?
Thanks

I visited the following link and i solved my problem:
Issues with xpath in java
The problem was that the xpath contained two namespaces (rdf,rdfs) like "/rdf:RDF/rdfs:Class".
If the xpath didn't contain any namespace e.g. /RDF/Class , there was not going to be an issue.
So after the line:
xpath = XPathFactory.newInstance().newXPath();
and before the line:
XPathExpression classes_expr = xpath.compile("/rdf:RDF/rdfs:Class");
I added the following:
xpath.setNamespaceContext(new NamespaceContext() {
public String getNamespaceURI(String prefix) {
switch (prefix) {
case "rdf": return "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
case "rdfs" : return "http://www.w3.org/2000/01/rdf-schema#";
}
return prefix;
}
public String getPrefix(String namespace) {
if (namespace.equals("rdf")) return "rdf";
else if (namespace.equals("rdfs")) return "rdfs";
else return null;
}
#Override
public Iterator getPrefixes(String arg0) {
// TODO Auto-generated method stub
return null;
}
});

XPath to select all nodes in document with specified name in java

Sample XML
<Games>
<Indoor>
<TT></TT>
<Chess></chess>
<cricket>asd</cricket>
<ComputerGame>
<cricket>asd</cricket>
</ComputerGame>
</Indore>
<Outdoor>
<Football></Football>
<cricket>asd</cricket>
</outdoor>
</Games>
I want to select all the node with node name cricket.
for this I am :
NodeList nodeList= (NodeList)xpath.compile("//cricket").evaluate(xmlDocument,XPathConstants.NODESET);
But this code doesnt select any cricket node. PLEASE SUGGEST

Based on your "corrected" example XML...
<Games>
<Indoor>
<TT></TT>
<Chess>
</Chess>
<cricket>asd</cricket>
<ComputerGame>
<cricket>asd</cricket>
</ComputerGame>
</Indoor>
<Outdoors>
<Football></Football>
<cricket>asd</cricket>
</Outdoors>
</Games>
Using the following code...
import java.io.File;
import java.io.IOException;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
public class TestXPath101 {
public static void main(String[] args) {
try {
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("Test.xml"));
XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression exp = xPath.compile("//cricket");
NodeList nl = (NodeList)exp.evaluate(doc, XPathConstants.NODESET);
System.out.println("Found " + nl.getLength() + " results");
} catch (ParserConfigurationException | SAXException | IOException | XPathExpressionException ex) {
ex.printStackTrace();
}
}
}
I was able to get it to output...
Found 3 results
I would "suspect" that you XML is ill formed and you are ignoring any exceptions that are being thrown because of it...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Unable to retrieve web data in java using tidy and Xpath - java

Related

XPath concat same tagname

Xpath 2.0 functions not working in Java using Saxon

xml file is successfully parsed from netbeans but not from runnable jar

Issues with xpath that contained namespaces in Java (Dom parser)

XPath to select all nodes in document with specified name in java

Categories

Resources