Text extraction with java html parsers - java

I want to use an html parser that does the following in a nice, elegant way
Extract text (this is most important)
Extract links, meta keywords
Reconstruct original doc (optional but nice feature to have)
From my investigation so far jericho seems to fit. Any other open source libraries you guys would recommend?

I recently experimented with HtmlCleaner and CyberNekoHtml. CyberNekoHtml is a DOM/SAX parser that produces predictable results. HtmlCleaner is a tad faster, but quite often fails to produce accurate results.
I would recommend CyberNekoHtml. CyberNekoHtml can do all of the things you mentioned. It is very easy to extract a list of all elements, and their attributes, for example. It would be possible to traverse the DOM tree building each element back into HTML if you wanted to reconstruct the page.
There's a list of open source java html parsers here:
http://java-source.net/open-source/html-parsers

I would definitely go for JSoup.
Very elegant library and does exactly what you need.
See Example Here

I ended up using HtmlCleaner http://htmlcleaner.sourceforge.net/ for something similar. It's really easy to use and was quick for what I needed.

Related

lookup nodes with xpath in java

Can anybody suggest how can I solve this, there is XML document that I'm trying to append. I'm looking up nodes with xpath, the things is the software which generates this XML sometimes screws it up as following :
<element name="Element ">Element value</element>
So when I'm looking the node up with xpath using //element[#name="Element"] without a blank space I don't get a match. Naturally I get match with this //element[#name="Element "]
Is there something I can do to match this without blank space?Does xpath accept regular expressions or there is more smart way to do this, I can change the xml file as well after it has been generated by the sofware(with faulty user input).
(untested).
Would
//element[normalize-space(#name)=="Element"]
work?
A more general (though less accurate) solution could be:
//element[contains(#name,"Element")]
However that would also catch things like name="Elements" and name="Elementary" etc.
can you recommend some useful resources for xpath
Unfortunately there aren't many great such resources in my experience - w3schools is probably your best bet, or the spec itself for more advanced stuff.

Interpret a rule applying multiple xpath queries on multiple XML documents

I need to build a component which would take a few XML documents in input and check the following kind of rules:
XML1:/bookstore/book[price>35.00] != null
and (XML2:/city/name = 'Montreal'
or XML3://customer[#language] contains 'en')
Basically my component should be able to:
substitute the XML tokens with the corresponding XML document(before colon)
apply xpath query on this XML document
check the xpath output against expected result ("=", "!=", "contains")
follow the basic syntax ("and", "or" and parentheses)
tell if the rule is true or false
Do you know any library which could help me? maybe JavaCC?
Thanks
For evaluating XPATHs I recommend JAXEN.
Jaxen is an open source XPath library
written in Java. It is adaptable to
many different object models,
including DOM, XOM, dom4j, and JDOM.
Is it also possible to write adapters
that treat non-XML trees such as
compiled Java byte code or Java beans
as XML, thus enabling you to query
these trees with XPath too.
The Java XPath API (Java 5 / javax.xml.xpath) is also an option, but I haven't tried it yet.
Somebody on the JavaCC mailing list pointed me to the right direction, mentioning Schematron. It led me to Probatron which seems to be the best java implementation available.
Schematron web site claims that the language supports "jump across links and between XML documents to check constraints" but it seems Probatron doesn't allow that. I may not to tweak it or find a trick for that (like building a temporary XML document containing all my source documents). Apart from that, it looks Probatron is the right library for me.

Java library for HTML analysis

(I've seen similar questions, but I think none of them cater to my specific needs, hence...)
I would like to know if there is a Java library for analysis of real-world (read: incomplete, ill-formed) HTML. By analysis, I mean things like:
figuring out the most prominent color in an HTML chunk
changing that color to some other color (hence, has to support modification of the HTML as well)
pruning out unwanted tags
fixing up the HTML to result in a well formed HTML snippet
Parts of the last two are done by libraries such as Jericho, and jTidy. 'Plugins' on top of these would be great.
Thanks in advance!
You might want to check out TagSoup:
http://home.ccil.org/~cowan/XML/tagsoup/
Well I would tidy it first into valid XML, then using XSLT do a conditional deep copy where I would do the most-prominent-color/pruning/whatever processing you need.
Take a look at JTidy, a Java port of HTML Tidy. It will, depending on what options you choose, fix non-well-formed HTML and otherwise clean it up.
You'll need something else for the colour changing stuff.
Maybe you will find something in this list (try TagSoup, NekoHTML, VietSpider HTMLParser).

Getting elements by type in malformed HTML

What's the easiest way in Java to retrieve all elements with a certain type in a malformed HTML page? So I want to do something like this:
public static void main(String[] args) {
// Read in an HTML file from disk
// Retrieve all INPUT elements regardless of whether the HTML is well-formed
// Loop through all elements and retrieve their ids if they exist for the element
}
HtmlCleaner is arguably one of the best HTML parsers out there when it comes to dealing with (somewhat) malformed HTML.
Documentation is here with some code samples; you're basically looking for getElementsByName() method.
Take a look at Comparison of Java HTML parsers if you're considering other libraries.
I've had success using tagsoup. Heres a short description from their home page:
This is the home page of TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Check Jtidy.
JTidy is a Java port of HTML Tidy, a
HTML syntax checker and pretty
printer. Like its non-Java cousin,
JTidy can be used as a tool for
cleaning up malformed and faulty HTML.
In addition, JTidy provides a DOM
interface to the document that is
being processed, which effectively
makes you able to use JTidy as a DOM
parser for real-world HTML.

Error-tolerant XML parsing in Scala

I would like to be able to parse XML that isn't necessarily well-formed. I'd be looking for a fuzzy rather than a strict parser, able to recover from badly nested tags, for example. I could write my own but it's worth asking here first.
Update:
What I'm trying to do is extract links and other info from HTML. In the case of well-formed XML I can use the Scala XML API. In the case of ill-formed XML, it would be nice to somehow convert it into correct XML (somehow) and deal with it the same way, otherwise I'd have to have two completely different sets of functions for dealing with documents.
Obviously because the input is not well-formed and I'm trying to create a well-formed tree, there would have to be some heuristic involved (such as when you see <parent><child></parent> you would close the <child> first and when you then see a <child> you ignore it). But of course this isn't a proper grammar and so there's no correct way of doing it.
What you're looking for would not be an XML parser. XML is very strict about nesting, closing, etc. One of the other answers suggests Tag Soup. This is a good suggestion, though technically it is much closer to a lexer than a parser. If all you want from XML-ish content is an event stream without any validation, then it's almost trivial to roll your own solution. Just loop through the input, consuming content which matches regular expressions along the way (this is exactly what Tag Soup does).
The problem is that a lexer is not going to be able to give you many of the features you want from a parser (e.g. production of a tree-based representation of the input). You have to implement that logic yourself because there is no way that such a "lenient" parser would be able to determine how to handle cases like the following:
<parent>
<child>
</parent>
</child>
Think about it: what sort of tree would expect to get out of this? There's really no sane answer to that question, which is precisely why a parser isn't going to be of much help.
Now, that's not to say that you couldn't use Tag Soup (or your own hand-written lexer) to produce some sort of tree structure based on this input, but the implementation would be very fragile. With tree-oriented formats like XML, you really have no choice but to be strict, otherwise it becomes nearly impossible to get a reasonable result (this is part of why browsers have such a hard time with compatibility).
Try the parser on the XHtml object. It is much more lenient than the one on XML.
Take a look at htmlcleaner. I have used it successfully to convert "HTML from the wild" to valid XML.
Try Tag Soup.
JTidy does something similar but only for HTML.
I mostly agree with Daniel Spiewak's answer. This is just another way to create "your own parser".
While I don't know of any Scala specific solution, you can try using Woodstox, a Java library that implements the StAX API. (Being an even-based API, I am assuming it will be more fault tolerant than a DOM parser)
There is also a Scala wrapper around Woodstox called Frostbridge, developed by the same guy who made the Simple Build Tool for Scala.
I had mixed opinions about Frostbridge when I tried it, but perhaps it is more suitable for your purposes.
I agree with the answers that turning invalid XML into "correct" XML is impossible.
Why don't you just do a regular text search for the hrefs if that's all you're interested in? One issue would be commented out links, but if the XML is invalid, it might not be possible to tell what is intended to be commented out!
Caucho has a JAXP compliant XML parser that is a little bit more tolerant than what you would usually expect. (Including support for dealing with escaped character entity references, AFAIK.)
Find JavaDoc for the parsers here
A related topic (with my solution) is listed below:
Scala and html parsing

Categories

Resources