How to parse html data using xpath in java

How to parse html data using xpath in java - java

I am writing a java code where i am extracting specific data from a particular url using xpath expression.
After executing my code i am not getting the desired results.
Here is my code:
try{
URL oracle = new URL();
URLConnection yc = oracle.openConnection();
InputStream is = yc.getInputStream();
is = oracle.openStream();
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(is, null);
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
XPathExpression xPathExpression = xPath.compile("");
Object result = xPathExpression.evaluate(tidyDOM,XPathConstants.NODESET);
System.out.println(result.toString());
}catch(Exception e){
System.out.println("error");
}
output:
com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList#7e97d1ff
i want product price from this url: http://www.flipkart.com/d-link-8-port-10-100m-unmanaged-standalone-switch-network/p/itmdffym2nhwyzvz
and the xpath i am usng in my code is : /html/body/div/div[2]/div/div/div[3]/div/div/div[3]/div[2]/div/div/div/div/span
can anyone tell me what i am doing wrong?

Related

Storing xml data in Java object using jaxb

<?xml version="1.0" encoding="UTF-8"?>
<filepaths>
<application_information_ticker>
<desc>Ticker1</desc>
<folder_path>../atlas/info/</folder_path>
</application_information_ticker>
<document_management_system>
<desc></desc>
<folder_path>../atlas/dms/</folder_path>
</document_management_system>
</filepaths>
I have a xml file like this. I need to convert this xml file into java object using JAXB. Because of nested tags, I couldn't perform the operation. Please suggest me a solution for this

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource( new StringReader( xmlString) );
Document doc = builder.parse( is );
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
xpath.setNamespaceContext(new PersonalNamespaceContext());
XPathExpression expr = xpath.compile("//src_small/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
List<String> list = new ArrayList<String>();
for (int i = 0; i < nodes.getLength(); i++) {
list.add (nodes.item(i).getNodeValue());
System.out.println(nodes.item(i).getNodeValue());

Apache Tika : How to use XPath queries

I am parsing an XML file using Apache Tika. I would like to extract certain tags with their content from the XML and store them in a HashMap. Right now, i can extract the entire content of the XML but the tags are lost
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = null;
try
{
inputstream = new FileInputStream(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()));
}
catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
ParseContext pcontext = new ParseContext();
//Xml parser
XMLParser xmlparser = new XMLParser();
xmlparser.parse(inputstream, handler, metadata, pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
which shows me the entire content of the XML
now, i want to extract certain parts of the XML, and since Tika allows XPath queries, i tried this
XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
Matcher divContentMatcher = xhtmlParser.parse("/Product/Source/Publisher/PublisherName[#nameType='Person']");
ContentHandler xhandler = new MatchingContentHandler(
new ToXMLContentHandler(), divContentMatcher);
AutoDetectParser parser = new AutoDetectParser();
Metadata xmetadata = new Metadata();
try (FileInputStream stream = new FileInputStream(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()))) {
parser.parse(stream, xhandler, xmetadata);
System.out.println(xhandler.toString());
} catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
but it does not show any output! i was hoping it would only give me the nodes specified in the XQuery.
Any idea what's going on?
by the way, here is the corresponding XML
<Product productID="xvc22" shortProductID="x" language="en">
<ProductStatus statusType="Published" />
<Source>
<Publisher sequence="1" primaryIndicator="Yes">
<PublisherID idType="Shortname">jjkjkj</PublisherID>
<PublisherID idType="BM">6666</PublisherID>
<PublisherName nameType="Legal">ABT</PublisherName>
<PublisherName nameType="Person">
<LastName>pppp</LastName>
<FirstName>lkkk</FirstName>
</PublisherName>
</Publisher>
</Source>
</Product>
also, when i test the query on
http://www.freeformatter.com/xpath-tester.html
i see the correct result i.e.
Element='<PublisherName nameType="Person">
<LastName>pppp</LastName>
<FirstName>lkkk</FirstName>
</PublisherName>'
is this some syntax issue with JAVA or Tika?
EDIT
Note that if i parse without Tika, it works
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()));
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("/Product/Source/Publisher/PublisherName[#nameType='Person']");
System.out.println(expr.evaluate(doc, XPathConstants.STRING));
this prints out
pppp
lkkk
which is perfect. so why cant Tika parse the XPath query?

The method createDOM not return document

I use HtmlCleaner 2.6.1 and Xpath to parse html page in Android application.
Here html page:
http://www.kino-govno.com/comments/42571-postery-kapitan-fillips-i-poslednij-rubezh
http://www.kino-govno.com/comments/42592-fantasticheskie-idei-i-mesta-ih-obitanija
The first link return document, is all right.The second link here in this place:
document = domSerializer.createDOM(tagNode);
returns nothing.
If you create a simple java project without android. That all works fine.
Here is the Code :
String queries = "//div[starts-with(#class, 'news_text op')]/p";
URL url = new URL(link2);
TagNode tagNode = new HtmlCleaner().clean(url);
CleanerProperties cleanerProperties = new CleanerProperties();
DomSerializer domSerializer = new DomSerializer(cleanerProperties);
document = domSerializer.createDOM(tagNode);
xPath = XPathFactory.newInstance().newXPath();
pageNode = (NodeList)xPath.evaluate(queries,document, XPathConstants.NODESET);
String val = pageNode.item(0).getFirstChild().getNodeValue();

That's because HtmlCleaner wraps the paragraphs of the second HTML page into another <div/>, so it is not a direct child any more. Use the descendent-or-self-axis // instead of the child-axis /:
//div[starts-with(#class, 'news_text op')]//p

Simplest way to parse this XML in Java?

I have the following XML:
<ConfigGroup Name="Replication">
<ValueInteger Name="ResponseTimeout">10</ValueInteger>
<ValueInteger Name="PingTimeout">2</ValueInteger>
<ValueInteger Name="ConnectionTimeout">10</ValueInteger>
<ConfigGroup Name="Pool">
<ConfigGroup Name="1">
<ValueString Encrypted="false" Name="Host">10.20.30.40</ValueString>
<ValueInteger Name="CacheReplicationPort">8899</ValueInteger>
<ValueInteger Name="RadiusPort">12050</ValueInteger>
<ValueInteger Name="OtherPort">4868</ValueInteger>
</ConfigGroup>
<ConfigGroup Name="2">
<ValueString Encrypted="false" Name="Host">10.20.30.50</ValueString>
<ValueInteger Name="CacheReplicationPort">8899</ValueInteger>
<ValueInteger Name="RadiusPort">12050</ValueInteger>
<ValueInteger Name="OtherPort">4868</ValueInteger>
</ConfigGroup>
</ConfigGroup>
</ConfigGroup>
I just wondering what is the simplest way to parse this XML in Java - I want the value from the two host elements (e.g. 10.20.30.40 and 10.20.30.50). Note there may be more than two pool entries (or none at all).
I'm having trouble finding a simple example of how to use the various XML parsers for Java.
Any help is much appreciated.
Thanks!

The simplest way to search for what you are looking for, would be XPath.
try {
//Load the XML File
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document configuration = builder.parse("configs.xml");
//Create an XPath expression
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();
XPathExpression expr = xpath.compile("//ConfigGroup/ValueString[#Name='Host']/text()");
//Execute the XPath query
Object result = expr.evaluate(configuration, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
//Parse the results
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getNodeValue());
}
} catch (ParserConfigurationException e) {
System.out.println("Bad parser configuration");
e.printStackTrace();
} catch (SAXException e) {
System.out.println("SAX error loading the file.");
e.printStackTrace();
} catch (XPathExpressionException e) {
System.out.println("Bad XPath Expression");
e.printStackTrace();
} catch (IOException e) {
System.out.println("IO Error reading the file.");
e.printStackTrace();
}
The XPath expression
"//ConfigGroup/ValueString[#Name='Host']/text()"
looks for ConfigGroup elements anywhere in your XML, then finds ValueString elements within the ConfigGroup elements, that have a Name attribute with the value "Host". #Name=Host is like a filter for elements with the name ValueString. And text() at the end, returns the text node of the selected elements.

Java XPath API allows to do it easily. The following xpath expression
//ValueString[#Name='Host']
should match what you want. Here is how to use it with the API :
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(yourXml.getBytes());
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodeList = (NodeList) xpath.compile("//ValueString[#Name='Host']").evaluate(doc, XPathConstants.NODESET);
for (int i = 0; i < nodeList.getLength(); i++) {
String ip = ((Element) nodeList.item(i)).getTextContent();
// do something with your ip
}

You could use SAXON
String vs_source = "Z:/Code_JavaDOCX/1.xml";
Processor proc = new Processor(false);
net.sf.saxon.s9api.DocumentBuilder builder = proc.newDocumentBuilder();
XPathCompiler xpc = proc.newXPathCompiler();
try{
XPathSelector selector = xpc.compile("//output").load();
selector.setContextItem(builder.build(new File(vs_source)));
for (XdmItem item: selector)
{
System.out.println(item.getStringValue());
}
}
catch(Exception e)
{
e.printStackTrace();
}

xpaths not working in java

I am trying to access a url, get the html from it and use xpaths to get certain values from it. I am getting the html just fine and Jtidy seems to be cleaning it appropriately. However, when I try to get the desired values using xpaths, I get an empty NodeList back. I know my xpath expression is correct; I have tested it in other ways. Whats wrong with this code. Thanks for the help.
String url_string = base_url + countries[c];
URL url = new URL(url_string);
Tidy tidy = new Tidy();
tidy.setShowWarnings(false);
tidy.setXHTML(true);
tidy.setMakeClean(true);
Document doc = tidy.parseDOM(url.openStream(), null);
//tidy.pprint(doc, System.out);
String xpath_string = "id('catlisting')//a";
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile(xpath_string);
NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
System.out.println("size="+nodes.getLength());
for (int r=0; r<nodes.getLength(); r++) {
System.out.println(nodes.item(r).getNodeValue());
}

Try "//div[#id='catlisting']//a"

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to parse html data using xpath in java - java

Related

Storing xml data in Java object using jaxb

Apache Tika : How to use XPath queries

The method createDOM not return document

Simplest way to parse this XML in Java?

xpaths not working in java

Categories

Resources