web crawling from BBC website using XPath - java

This is an example of a typical webpage I am trying to crawl
http://www.bbc.com/news/business-31013604
If you inspect the element of the webpage. The main article is under
<div class="story-body">
However, when I try to get the main content using
MongoClient mongoClient = new MongoClient("127.0.0.1", 27017);
DB db = mongoClient.getDB("nutch");
DBCollection coll = db.getCollection("crawl_data");
BasicDBObject bo = new BasicDBObject("url", url).append("fetch_time", new Date());
bo.append("article_text", getXPathValue(doc,"//DIV[#class='story-body']"));
I am not able to get the article content. In the database, it shows null in that field.
I have successfully crawled some pages from reuters, so fucntion getXPathValue should be correct.
I fetch pages using http request. Don't know if that is the issue here.

The problem is that you are crawling an XHTML page (or at least a document in the XHTML namespace). The most significant difference between HTML and XHTML is that XHTML documents have a default namespace:
<root xmlns="www.example-of-default-namespace.com"/>
An XPath expression that does not take into account namespaces, for example
//root
will never find this element, because it's in a namespace.
The same happens with your XHTML document. There are two ways to solve this problem.
Register the XHTML namespace
The first, and more appropriate, solution is to register or declare the XHTML namespace in your code, and then use a prefix in your XPath expression. Since you do not show any code, I can hardly comment on that, we don't even know the programming language.
Ignore namespaces
Secondly, you can ignore any namespaces by modifying your XPath expression to
//*[local-name() = 'div' and #class='story-body']
Here * is a wildcard for any element, in any (or no) namespace, and local-name() returns the local part of an element or attribute name. In XML, there are qualified names that look like:
prefix:root
The first part of this qualified name is the prefix, and the second part is the local name of this element. So, the result of local-name(prefix:root) is root.
Also note that I have lowercased "div". HTML might be case-insensitive, but XHTML, and by extension, XML, and by extension, XPath are not.

Related

how to use text() in jxpath

Can you get the text() of a jxpath element or does it not work?
given some nice xml:
<?xml version="1.0" encoding="UTF-8"?>
<AXISWeb xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="AXISWeb.xsd">
<Action>
<Transaction>PingPOS</Transaction>
<PingPOS>
<PingStep>To POS</PingStep>
<PingDate>2012-11-15</PingDate>
<PingTime>16:35:57</PingTime>
</PingPOS>
<PingPOS>
<PingStep>POS.PROCESSOR18</PingStep>
<PingDate>2012-11-15</PingDate>
<PingTime>16:35:57</PingTime>
</PingPOS>
<PingPOS>
<PingStep>From POS</PingStep>
<PingDate>2012-11-15</PingDate>
<PingTime>16:35:57</PingTime>
</PingPOS>
</Action>
</AXISWeb>
//Does not work:
jxpc.getValue("/AXISWeb/Action/PingPOS[1]/PingStep/text()");
//Does not work:
jxpc.getValue("/action/pingPOS[1]/PingStep/text()");
//Does not work:
jxpc.getValue("/action/pingPOS[1]/PingStep[text()]");
I know I can get the text from using
jxpc.getValue("/action/pingPOS[1]/PingStep");
But that's not the point.
Shouldn't text() work? I could find no examples....
P.S. It's also very very picky about case and capitalization. Can you turn that off somehow?
Thanks,
-G
/AXISWeb/Action/PingPOS[1]/PingStep/text() is valid XPath for your document
But, from what I can see from the user guide of jxpath (note: I don't know jxpath at all), getValue() is already supposed to return the textual content of a node, so you don't need to use the XPath text() at all.
So you may use the following:
jxpc.getValue("/AXISWeb/Action/PingPOS[1]/PingStep");
Extracted from the user guide:
Consider the following XML document:
<?xml version="1.0" ?>
<address>
<street>Orchard Road</street>
</address>
With the same XPath, getValue("/address/street"), will return the string "Orchard Road", while
selectSingleNode("/address/street") - an object of type Element (DOM
or JDOM, depending on the type of parser used). The returned Element
is, of course, <street>Orchard Road</street>.
Now about case insensitive query on tag names, if you are using XPath 2 you can use lower-case() and node() but this is not really recommended, you may better use correct names.
/*[lower-case(node())='axisweb']/*[lower-case(node())='action']/...
or if using XPath 1, you may use translate() but it gets even worse:
/*[translate(node(),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz') = 'axisweb']/*[translate(node(),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz') = 'action']/...
All in all, try to ensure that you use correct query, you know it is case sensitive, so it's better to pay attention to it. As you would do in Java, foo and fOo are not the same variables.
Edit:
As I said, XML and thus XPath is case sensitive, so pingStep cannot match PingStep, use the correct name to find it.
Concerning text(), it is part of XPath 1.0, there is no need for XPath 2 to use it. The JXPath getValue() is already doing the call to text() for you. If you want to do it yourself you will have to use selectSingleNode("//whatever/text()") that will returns an Object of type TextElement (depending on the underlying parser).
So to sum up, the method JXPathContext.getValue() already does the work to select the node's text content for you, so you don't need to do it yourself and explicitly call XPath's text().
From a post that I've anserwed before the method .getTextContent() do the job for you.
No need to use "text()" when you evaluate the Xpath.
Example :
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("D:\\Loic_Workspace\\Test2\\res\\test.xml"));
System.out.println(doc.getElementsByTagName("retCode").item(0).getTextContent());
If not, you will get the tag and the value. If you want do more take a look at this

XML parsing in java : ignore tags as value

I am having some troubles with parsing an XML file.
The Problem:
<verification appearance="4">
content="<myTag>test<myTag>/images/titleIcon.png"
</verification>
For parsing I used the following:
DocumentBuilder db;
db = DocumentBuilderfactory.newInstance().newDocumentBuilder();
this.doc = db.parse()
If I access the content with [...]getChildNodes().item(1).getTextContent(),
it returns the value without the tags.
I assume the problem has something to do with db.parse(). More Specifically, that he parses <myTag> as a node or something like that.
How can I get the full TextContent as String (including Tags etc.)?
Is there a way to tell the parser (if that's the problem) to ignore all Content that is within two tags?
I already googled a lot. But Solutions like using &lt ; for < isn't that what I'm looking for.
To do this this XML would have to be like this:
<verification appearance="4">
<![CDATA[
content="<myTag>test<myTag>/images/titleIcon.png"
]]>
</verification>
Then the parser will work as you want it to work.

In XSLT, how do I get the filepath of the xml file of a certain element if that xml file was included with xinclude?

I have these XML files:
master.xml (which uses XInclude to include child1.xml and child2.xml)
child1.xml
child2.xml
Both child1.xml and child2.xml contain a <section> element with some text.
In the XSLT transformation, I 'd want to add the name of the file the <section> element came from, so I get something like:
<section srcFile="child1.xml">Text from child 1.</section>
<section srcFile="child2.xml">Text from child 2.</section>
How do I retrieve the values child1.xml and child2.xml?
Unless you turn off that feature, all XInclude processors should add an #xml:base attribute
with the URL of the included file. So you don't have to do anything, it should already be:
<section xml:base="child1.xml">Text from child 1.</section>
<section xml:base="child2.xml">Text from child 2.</section>
( If you want, you can use XSLT to transform the #xml:base attr into #srcFile. )
I'm 99% sure that once xi:include has been processed, you have a single document (and single infoset) that won't let you determine which URL any given part of the document came from.
I think you will need to place that information directly in the individual included files. Having said that, you can still give document-uri a try, but I think all nodes will return the same URI.

Using Both Tagged And Untagged Data With XPath

I'm trying to parse some HTML using XPath in Java. Consider this HTML:
<td class="postbody">
<img src="...""><br />
<br />
<b>What is Blah?</b><br />
<br />
Blah blah blah
<br />
Note that "What Is Blah" is helpfully contained within a b tag and is therefore easily parseable. But "Blah blah blah" is out in the open, and so I can only pick it up by calling text() on its parent node.
Thing is, I need to go through this in sequence, putting the img down, then the bolded text, then the body text. It's important it ends up in order (it needn't be processed in order, if you can suggest a way that takes two passes).
So are there any suggestions for how, if I've got the above contained within a Java XPath node, I can go through it in turn and get what I need?
I think an SAX based parser would be a better tool for this problem. It's event based so you can parse your XML document in order.
But it's an XML parser so you'll need to have a valid XML document. I never used JTidy but it's a java port of the HTML Tidy, so hopefully it can help you to transform your (invalid) HTML documents to a valid XML.
Use this XPath expression evaluated with the parent of the provided XML fragment as the context node:
node()
This selects every node - child of the context node -- every element -child, every text-node-child, every comment-child and every PI (processing instruction) - child.
In case you want to exclude comments and PIs, use:
node()[not(self::comment() or self::processing-instruction)]
In case that in addition to this you don't want to select the whitespace-only-text-nodes, use:
node()
[not(self::comment() or self::processing-instruction)]
[not(self::text()[string-length() = 0])]

Read in html table to java

I need to pull data from an html page using Java code. The java part is required.
The page i am trying to pull info from is http://www.weather.gov/data/obhistory/KMCI.html
.
I need to create a list of hashmaps...or some kind of data object that i can reference in later code.
This is all i have so far:
URL weatherDataKC = new URL("http://www.weather.gov/data/obhistory/KMCI.html");
InputStream is = weatherDataKC.openStream();
int cnt = 0;
StringBuffer buffer = new StringBuffer();
while ((cnt = is.read()) != -1){
buffer.append((char) cnt);
}
System.out.print(buffer.toString());
Any suggestions where to start?
there is a nice HTML parser called Neko:
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.
More information here.
Use an HTML parser like CyberNeko
J2SE includes HTML parsing capabilities, in packages javax.swing.text.html and javax.swing.text.html.parser. HTMLEditorKit.ParserCallback receives events pushed by DocumentParser (better be used through ParserDelegator). The framework is very similar to the SAX parsers for XML.
Beware, there are some bugs. It won't be able to handle bad HTML very well.
Dealing with colspan and rowspan is your business.
HTML scraping is notoriously difficult, unless you have a lot of "hooks" like unique IDs. For example, the table you want starts with this HTML:
<table cellspacing="3" cellpadding="2" border="0" width="670">
...which is very generic and may match several tables on the page. The other problem is, what happens if the HTML structure changes? You'll have to redefine all your parsing rules...

Categories

Resources