How can I efficiently parse HTML with Java? - java

I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both the tasks.
I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.
I want to know which HTML parser can parse HTML efficiently. I need
Speed
Ease to locate any HtmlElement by its "id" or "name" or "tag type".
It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.
Its party trick is a CSS selector syntax to find elements, e.g.:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();
See the Selector javadoc for more info.
This is a new project, so any ideas for improvement are very welcome!

The best I've seen so far is HtmlCleaner:
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
With HtmlCleaner you can locate any element using XPath.
For other html parsers see this SO question.

I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. It is the parser used in Mozilla from 2010-05-03

Related

Convert HTML divs to Java/JSON objects?

Is there a way where I can read an entire website in HTML in my code and then convert the HTML to java or json objects, kinda? Would be cool to crawl a site and the extract text from certain divs. Is there som way to use a marshaller for this?
You could check xpath which can be used to identify html elements on webpages. It can select certain elements or search for texts using regular expressions.
For example this would be the xpath to your questions paragraph //*[#id="question"]/div/div[2]/div[1]/p (extracted from chrome dev tools). It can be used in combination with selenium web driver if you want to crawl a webpage using java.

Efficient way to parse HTML dump found in the form of string

Please deal with this trivial question. It is available in bits and pieces on stackoverflow.
I have HTML dump of a website in the form of String. I want to extract text from the specific tags of it.
In other way, I want to mimic
Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
Elements links = doc.getElementsByTag("cite");
I am not using Jsoup because I don't want it to connect to the website (I have another service for that which returns html dump in the form of text). I found HTMLEditorKit for converting text to HTMLDocument but it doesn't seem to be very easy to use(like Jsoup or HTMLParser) or I am unable to get it.
Any help would be useful.
Thanks.
If you have used Jsoup and it worked yet, you should continue using it.
Document doc = Jsoup.parse("<html>...");
should do.
see: The API

html search and replace preserving html tags

I'm looking for a Java based html parser which can search and replace text preserving html tags. This question has been asked here before but the answers seems to be not hitting the target. There are few html parsers which I downloaded and wrote simple programs to see whether they can do the job. These include jsoup, Jericho, Java HTML parser etc. These can do a search but when it comes to replacing text preserving html tags, there is no way to do it.
I have read the complete thread for these posts:
How to find/replace text in html while preserving html tags/structure
html search and replace on server side
If there are no such parser exists today, what is the best way for implementing one? If you have done something like this already, can you share the code?
The Jericho parser might help you. Has been around forever and works with malformed HTML.
http://jericho.htmlparser.net/docs/index.html
The Caja parser uses libhtmlparser, an HTML5 parser that deals well with tag soup containing embedded XML subtrees producing an org.w3c.dom.DocumentFragment, and has a renderer that produces well formed HTML.
The parser code is at http://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/parser/html/DomParser.java
The renderer code is at http://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/parser/html/Nodes.java

Sanitize HTML data

I'm fetching data from different RSS / ATOM feeds and sometimes the HTML data I receive contains HTML tags but they dont have close tags or some other issues and it screws up the page layout / styling.
Somethings there is class name / id clash. Is there any way to sanitize it?
If anybody can point me to some reliable Javascript / Java implementation.
You can give JTidy a try.
JTidy can be used as a tool for cleaning up malformed and faulty HTML.
Another option is HTML Cleaner
HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
I have used NekoHTML with great success. It's just a thin layer over the Apache parser that puts it into error-correcting mode, which is a great architecture as every time Apache gets better so does Neko. And there's no huge amount of extra code.

Getting elements by type in malformed HTML

What's the easiest way in Java to retrieve all elements with a certain type in a malformed HTML page? So I want to do something like this:
public static void main(String[] args) {
// Read in an HTML file from disk
// Retrieve all INPUT elements regardless of whether the HTML is well-formed
// Loop through all elements and retrieve their ids if they exist for the element
}
HtmlCleaner is arguably one of the best HTML parsers out there when it comes to dealing with (somewhat) malformed HTML.
Documentation is here with some code samples; you're basically looking for getElementsByName() method.
Take a look at Comparison of Java HTML parsers if you're considering other libraries.
I've had success using tagsoup. Heres a short description from their home page:
This is the home page of TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Check Jtidy.
JTidy is a Java port of HTML Tidy, a
HTML syntax checker and pretty
printer. Like its non-Java cousin,
JTidy can be used as a tool for
cleaning up malformed and faulty HTML.
In addition, JTidy provides a DOM
interface to the document that is
being processed, which effectively
makes you able to use JTidy as a DOM
parser for real-world HTML.

Categories

Resources