Java library for HTML analysis

Java library for HTML analysis - java

(I've seen similar questions, but I think none of them cater to my specific needs, hence...)
I would like to know if there is a Java library for analysis of real-world (read: incomplete, ill-formed) HTML. By analysis, I mean things like:
figuring out the most prominent color in an HTML chunk
changing that color to some other color (hence, has to support modification of the HTML as well)
pruning out unwanted tags
fixing up the HTML to result in a well formed HTML snippet
Parts of the last two are done by libraries such as Jericho, and jTidy. 'Plugins' on top of these would be great.
Thanks in advance!

You might want to check out TagSoup:
http://home.ccil.org/~cowan/XML/tagsoup/

Well I would tidy it first into valid XML, then using XSLT do a conditional deep copy where I would do the most-prominent-color/pruning/whatever processing you need.

Take a look at JTidy, a Java port of HTML Tidy. It will, depending on what options you choose, fix non-well-formed HTML and otherwise clean it up.
You'll need something else for the colour changing stuff.

Maybe you will find something in this list (try TagSoup, NekoHTML, VietSpider HTMLParser).

Related

Scraping issue (data-reactid)

I'm trying to scrape a website and compile a spreadsheet based on what data I pull.
The website I am trying to scrape is WEARVR.
I am not too experienced with scraping, but my approach would be to find unique attributes within html tags and use this to scrape what I want.
So for this website my approach would be firstly to scrape a list of URLs of the pages you are taken to upon clicking on one of the experiences, for example : https://www.wearvr.com/#game_id=game_1041, and then secondly, cycle through this list scraping the relevant attributes each time.
However I am stuck at the first step as instead of working with simple "a href" tags, I come across "data-reactid" tags which confuse the matter.
I do my scraping with iMacros but I'm pretty decent at Java now so would learn scraping in Java if need be (which seems likely as iMacros is pretty limited).
My question is, how do these "data-reactid" tags work, and as such how can I utilise them for my scraping purposes?
Additionally if this is an XY problem, please let me know and suggest a better approach.
Thanks for reading!

The simplest way to approach scraping is to treat the page like a big string (because ultimately, that is what it is). You can search within that string for certain things (like href=) to grab links. You can also intelligently assume that whatever is in the a tags is relevant to the link and grab that.
You really don't have to understand HTML, and you don't have to understand how the page or any additional css or markup work, you just need to identify what sort of identifiable string combinations are around the text you want. I will say this is probably much easier to implement in Java than using IMacro, and probably more accurate.
The other way you can handle it, which requires a little more knowledge of HTML and XML, is to treat the entire page as an XML document. This...doesn't always work with HTML, particularly if it is older or badly formed, so the string approach is easier. You get some utility out of the various XML map libraries that exist, but otherwise its similar to the above.

How to optimize the HTML text copied from MS Word with GWT?

I'm having a problem with RichTextAreas, so my problem is:
when i paste into RichTextArea the copied text from Ms Word or OpenOffice,it keeps all text styles and this is perfect, But one bad thing is it's HTML text is huge enough :( .
And database's size increasing because of unnecessary HTML tags.
My question is:"How to optimize that HTML text easily?"
Thanks!!!

RichTextArea is based on the browser's contentEditable support. This means that the HTML "tag soup" that you'll wind up with is going to be platform-, source-, and browser-specific. When you say "optimize" what's your end goal? How much of the original formatting do you want to preserve? Beyond just trivial minification of the HTML that's being pasted in, any significant reduction in the complexity of the HTML will likely result in a loss of visual fidelity.
Utilities such as HTML Tidy or any of its derivatives can probably help you with the minification aspect. If your goal is to reduce the complexity of the HTML, you might consider using HTMLUnit as a captive, server-side browser to render the pasted content in memory and then extract the attributes that you consider useful from HTMLUnit's DOM. FWIW, this is one way to make AJAX apps crawlable by search engines.
While reducing visual fidelity can be a little disconcerting to the original user, it does afford you the opportunity to unify the visual style of all pasted content. If you're building a site based on contributions from many users, this homogeneity decreases the amount of mental effort required to orient (i.e. see what you're seeing) the content.

Finally, i figured out the answer for my own question:
I found TinyMCE for GWT good enough for me, it has copy from ms word option and its HTML optimization is excellent .

Related question
Html Tidy has an API you can use in Java programs.

Java: Best way to remove Javascript from HTML

What's the best library/approach for removing Javascript from HTML that will be displayed?
For example, take:
<html><body><span onmousemove='doBadXss()'>test</span></body></html>
and leave:
<html><body><span>test</span></body></html>
I see the DeXSS project. But is that the best way to go?

JSoup has a simple method for sanitizing HTML based on a whitelist.
Check http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer
It uses a whitelist, which is safer then the blacklist approach DeXSS uses. From the DeXSS page:
There are still a number of known XSS attacks that DeXSS does not yet detect.
A blacklist only disallows known unsafe constructions, while a whitelist only allows known safe constructions. So unknown, possibly unsafe constructions will only be protected against with a whitelist.

The easiest way would be to not have those in the first place... It probably would make sense to allow only very simple tags to be used in free-form fields and to disallow any kind of attributes.
Probably not the answer you're going for, but in many cases you only want to provide markup capabilities, not a full editing suite.
Similarly, another even easier approach would be to provide a text-based syntax, like Markdown, for editing. (not that many ways you can exploit the SO edit area, for instance. Markdown syntax + limited tag list without attributes).

You could try dom4j http://dom4j.sourceforge.net/dom4j-1.6.1/ This is a DOM parser (as opposed to SAX) and allows you to easily traverse and manipulate the DOM, removing node attributes like onmouseover for example (or entire elements like <script>), before writing back out or streaming somewhere. Depending on how wild your html is, you may need to clean it up first - jtidy http://jtidy.sourceforge.net/ is good.
But obviously doing all this involves some overhead if you're doing this at page render time.

Text extraction with java html parsers

I want to use an html parser that does the following in a nice, elegant way
Extract text (this is most important)
Extract links, meta keywords
Reconstruct original doc (optional but nice feature to have)
From my investigation so far jericho seems to fit. Any other open source libraries you guys would recommend?

I recently experimented with HtmlCleaner and CyberNekoHtml. CyberNekoHtml is a DOM/SAX parser that produces predictable results. HtmlCleaner is a tad faster, but quite often fails to produce accurate results.
I would recommend CyberNekoHtml. CyberNekoHtml can do all of the things you mentioned. It is very easy to extract a list of all elements, and their attributes, for example. It would be possible to traverse the DOM tree building each element back into HTML if you wanted to reconstruct the page.
There's a list of open source java html parsers here:
http://java-source.net/open-source/html-parsers

I would definitely go for JSoup.
Very elegant library and does exactly what you need.
See Example Here

I ended up using HtmlCleaner http://htmlcleaner.sourceforge.net/ for something similar. It's really easy to use and was quick for what I needed.

Error-tolerant XML parsing in Scala

I would like to be able to parse XML that isn't necessarily well-formed. I'd be looking for a fuzzy rather than a strict parser, able to recover from badly nested tags, for example. I could write my own but it's worth asking here first.
Update:
What I'm trying to do is extract links and other info from HTML. In the case of well-formed XML I can use the Scala XML API. In the case of ill-formed XML, it would be nice to somehow convert it into correct XML (somehow) and deal with it the same way, otherwise I'd have to have two completely different sets of functions for dealing with documents.
Obviously because the input is not well-formed and I'm trying to create a well-formed tree, there would have to be some heuristic involved (such as when you see <parent><child></parent> you would close the <child> first and when you then see a <child> you ignore it). But of course this isn't a proper grammar and so there's no correct way of doing it.

What you're looking for would not be an XML parser. XML is very strict about nesting, closing, etc. One of the other answers suggests Tag Soup. This is a good suggestion, though technically it is much closer to a lexer than a parser. If all you want from XML-ish content is an event stream without any validation, then it's almost trivial to roll your own solution. Just loop through the input, consuming content which matches regular expressions along the way (this is exactly what Tag Soup does).
The problem is that a lexer is not going to be able to give you many of the features you want from a parser (e.g. production of a tree-based representation of the input). You have to implement that logic yourself because there is no way that such a "lenient" parser would be able to determine how to handle cases like the following:
<parent>
<child>
</parent>
</child>
Think about it: what sort of tree would expect to get out of this? There's really no sane answer to that question, which is precisely why a parser isn't going to be of much help.
Now, that's not to say that you couldn't use Tag Soup (or your own hand-written lexer) to produce some sort of tree structure based on this input, but the implementation would be very fragile. With tree-oriented formats like XML, you really have no choice but to be strict, otherwise it becomes nearly impossible to get a reasonable result (this is part of why browsers have such a hard time with compatibility).

Try the parser on the XHtml object. It is much more lenient than the one on XML.

Take a look at htmlcleaner. I have used it successfully to convert "HTML from the wild" to valid XML.

Try Tag Soup.
JTidy does something similar but only for HTML.

I mostly agree with Daniel Spiewak's answer. This is just another way to create "your own parser".
While I don't know of any Scala specific solution, you can try using Woodstox, a Java library that implements the StAX API. (Being an even-based API, I am assuming it will be more fault tolerant than a DOM parser)
There is also a Scala wrapper around Woodstox called Frostbridge, developed by the same guy who made the Simple Build Tool for Scala.
I had mixed opinions about Frostbridge when I tried it, but perhaps it is more suitable for your purposes.

I agree with the answers that turning invalid XML into "correct" XML is impossible.
Why don't you just do a regular text search for the hrefs if that's all you're interested in? One issue would be commented out links, but if the XML is invalid, it might not be possible to tell what is intended to be commented out!

Caucho has a JAXP compliant XML parser that is a little bit more tolerant than what you would usually expect. (Including support for dealing with escaped character entity references, AFAIK.)
Find JavaDoc for the parsers here

A related topic (with my solution) is listed below:
Scala and html parsing

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java library for HTML analysis - java

You might want to check out TagSoup: http://home.ccil.org/~cowan/XML/tagsoup/

Well I would tidy it first into valid XML, then using XSLT do a conditional deep copy where I would do the most-prominent-color/pruning/whatever processing you need.

Take a look at JTidy, a Java port of HTML Tidy. It will, depending on what options you choose, fix non-well-formed HTML and otherwise clean it up. You'll need something else for the colour changing stuff.

Maybe you will find something in this list (try TagSoup, NekoHTML, VietSpider HTMLParser).

Related

Scraping issue (data-reactid)

How to optimize the HTML text copied from MS Word with GWT?

Java: Best way to remove Javascript from HTML

Text extraction with java html parsers

Error-tolerant XML parsing in Scala

Categories

Resources