Library to query HTML with XPath in Java? - java

Can anyone recommend me a java library to allow me XPath Queries over URLs?
I've tried JAXP without success.
Thank you.

There are several different approaches to this documented on the Web:
Using HtmlCleaner
HtmlCleaner / Java DOM parser - Using XPath Contains against HTML in Java (This is the way I recommend)
HtmlCleaner itself has a built in utility supporting XPath - See the javadocs http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/XPather.html or this example http://thinkandroid.wordpress.com/2010/01/05/using-xpath-and-html-cleaner-to-parse-html-xml/
Using Jericho
Jericho and Jaxen
http://sujitpal.blogspot.com/2009/04/xpath-over-html-using-jericho-and-jaxen.html
I have tried a few different variations of these approaches, i.e. HtmlParser plus the Java DOM parser, and JSoup plus Jaxen, but the combination that worked best is HtmlCleaner plus the Java DOM parser. The next best combination was Jericho plus Jaxen.

jsoup, Java HTML Parser Very similar to jQuery syntax way.

You could use TagSoup together with Saxon. That way you simply replace any XML SAX parser used with TagSoup and the XPath 2.0 or XSLT 2.0 or XQuery 1.0 implementation works as usual.

Use Xsoup. According to the docs, it's faster than HtmlCleaner. Example
#Test
public void testSelect() {
String html = "<html><div><a href='https://github.com'>github.com</a></div>" +
"<table><tr><td>a</td><td>b</td></tr></table></html>";
Document document = Jsoup.parse(html);
String result = Xsoup.compile("//a/#href").evaluate(document).get();
Assert.assertEquals("https://github.com", result);
List<String> list = Xsoup.compile("//tr/td/text()").evaluate(document).list();
Assert.assertEquals("a", list.get(0));
Assert.assertEquals("b", list.get(1));
}
Link to Xsoup - https://github.com/code4craft/xsoup

I've used JTidy to make HTML into a proper DOM, then used plain XPath to query the DOM.
If you want to do cross-document/cross-URL queries, better use JTidy with XQuery.

Related

Java Html parser to extract specific data?

I have a html file like the following
...
<span itemprop="A">234</span>
...
<span itemprop="B">690</span>
...
In this i want to extract values as A and B.
Can u suggest any html parser library for java that can do this easily?
Personally, I favour JSoup over JTidy. It has CSS-like selectors, and the documentation is much better, imho. With JSoup, you can easily extract those values with the following lines:
Document doc = Jsoup.connect("your_url").get();
Elements spans = doc.select("span[itemprop]");
for (Element span : spans) {
System.out.println(span.text()); // will print 234 and 690
}
http://jsoup.org/
JSoup is the way to go.
JTidy is a confusingly named yet respected HTML parser.

Regular expression for getting HREF based on span tag [duplicate]

I have a requirement where I need to get the last HREF in the HTML code, means getting the HREF in the footer of the page.
Is there any direct regular expression for the same?
No regex, use the :last jQuery selector instead.
demo :
foo
bar
var link = $("a:last");
You could use plain JavaScript for this (if you don't need it to be a jQuery object):
var links = document.links;
var lastLink = links[links.length - 1];
var lastHref = lastLink.href;
alert(lastHref);
JS Fiddle demo.
Disclaimer: the above code only works using JavaScript; as HTML itself has no regex, or DOM manipulation, capacity. If you need to use a different technology please leave a comment or edit your question to include the relevant tags.
It's not a good idea to parse html with regular expressions. Have a look at HtmlParser
to parse html.

How to use Freemarker to convert a XML Word document to a DOC?

I'm trying to use Freemarker to convert an XML Word document to a standard DOC. For example:
I generate a Word document (A.doc) and then save it as XML Word document (A.xml).
On Freemarker, I import A.xml and export it as 2003 Word (B.doc).
In POI, I import the converted DOC (B.doc). (POI can't read XML docs.)
The problem is: the converted document isn't really a DOC, it's an XML doc,
so POI fails to open it.
How to use Freemarker generate a real DOC, not a XML word document?
I'm using Linux.
Your approach probably won't work because FreeMarker is designed for generating text output. Classic Word DOC files are not very "textual", so I think FreeMarker is not the right tool for your task.
(Side note: but RTF might work)

Could the value of an html anchor tag be fetched using xpath?

If I have HTML that looks like:
<td class="blah">&nbs;???? </td>
Could I get the ???? value using xpath?
What would it look like?
To use XPath you usually need XML not HTML, but some parsers (e.g. the one built into PHP) have a relaxed Mode which will parse most HTML, too.
If you want to find all <a> that are direct children of <td class="blah"> the XPath you need is
//td[#class = 'blah']/a
or
//td[#class = 'blah']/a[#href = 'http://...']
(depending on whether you only want the one url or all urls)
This will give you a Set of Nodes. You'll need to iterate through it and then check for the nodeType of the firstChild (supposed to be a text node) and the number of child nodes (supposed to be 1). Then the firstChild will contain the ????
Why would you use an XML parser to parse HTML?
I would suggest using a dedicated Java HTML parser, there are many, but I haven't tried any myself.
As for your question, would it work, I suspect it will not work, you will get an error when trying to parse it as HTML right at &nbs; if not earlier.

Possible to parse a HTML document and build a DOM tree(java)

Is it possible and what tools could be used to parse an html document as a string or from a file and then to construct a DOM tree so that a developer can walk the tree through some API.
For example:
DomRoot = parse("myhtml.html");
for (tags : DomRoot) {
}
Note: this is a HTML document not XHtml.
You can use TagSoup - it is a SAX Compliant parser that can clean malformed content such as HTML from generic web pages into well-formed XML.
This is <B>bold, <I>bold italic, </b>italic, </i>normal text
gets correctly rewritten as:
This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
JTidy should let you do what you want.
Usage is fairly straight forward, but parsing is configurable. e.g.:
InputStream in = ...;
Tidy tidy = new Tidy();
// configure Tidy instance as required
...
...
Document doc = tidy.parseDOM(in, null);
Element root = doc.getDocumentElement();
The JavaDoc is hosted here.
You can take a look at NekoHTML, a Java library that performs a best effort cleaning and tag balancing in your document. It is an easy way to parse a malformed HTML (or a non-valid XML) file.
It is distributed under the Apache 2.0 license.
HTML Parser seems to support conversion from HTML to XML. Then you can build a DOM tree using the usual Java toolchain.
There are several open source tools to parse HTML from Java.
Check http://java-source.net/open-source/html-parsers
Also you can check answers to this question: Reading HTML file to DOM tree using Java It is almost the same...

Categories

Resources