Java library for cleaning up HTML just like a browser would

Java library for cleaning up HTML just like a browser would - java

So here's the challenge... I need to create clean HTML from random web pages out there in the wild. My goal is to read in a page and pass it off to a library which will in turn give me back perfectly well-formed HTML.
Doesn't sound so tough, right? After all, every browser on the market effectively deals with the challenge of malformed HTML and turning it into something render-able with nearly every page load. Each has its own slightly particular algorithm for cleaning up the contents (ahem...for HTML < 5 that is), but they tend to do a very good job of capturing what i like to refer to as the author's intention. So then, why can't I find a good java library for this very task?
One thing to mention is that I'm not at all interested in parsing the HTML as XML. I've found that libraries such as NekoHTML, TagSoup, HtmlCleaner, and JTidy (to name a few) are more focused on solving the problem of converting to HTML to valid XML, and in the process, they lose sight of how the poorly-formatted document should be re-structured. With nasty HTML they frequently don't capture the author's intention and spit out documents that render quite differently from the original source. And for this project, it's of the utmost importance that the two documents render similarly.
I am quite fond of Jericho HTML, but it doesn't seem to be the ideal candidate for this job...at least not without a lot of effort on my part. Also, Native dependencies are a no-go, so the mozilla parser is out.
Can anyone help me in my search for the perfect HTML parser? Thanks in advance!

JSoup I would say
See Also
which-html-parser-is-best

I have used HTML Tidy in the past.

TagSoup?

Related

How to get the coordinates and dimension of a div using Java

there
I am working on a project which would translate the html code of a web into a specific JS library using JAVA, so that the div blocks can have different dynamic behaviors.
To translate the html div into a JS object, I have to know the coordinates of it as well as the width and length.
I turned into several JAVA html parser library: http://java-source.net/open-source/html-parsers
But none of them have this functionality except Cobra http://lobobrowser.org/cobra/java-html-parser.jsp . It has a rendering engine which could provide the coordinates and dimension of a div. But this library turns out to be really buggy. I cannot even run through its test which comes with the library.
Does anyone know how to handle this problem? I would really appreciate it if you could help!
Thanks in advance!
Phil

You could try some component of HtmlUnit, which emulates a browser. Honestly though, I think you need to think about your question more carefully. JQuery can do the 'different dynamic behaviours' thing you talk about via modification of the HTML DOM (Document Object Model) with Javascript, and if you need anything in the HTML document, inspection of the DOM via Javascript should be your first port of call. Java should not be required anywhere (unless you're using it server-side for page and input processing with JSP or some similar tech). Any responses to client input can be triggered server-side and sent to Javascript on the client-side, which triggers JQuery actions that modify the DOM.

Parsing HTML webpages in Java

I need to parse/read a lot of HTML webpages (100+) for specific content (a few lines of text that is almost the same).
I used scanner objects with reg. expressions and jsoup with its html parser.
Both methods are slow and with jsoup I get the following error:
java.net.SocketTimeoutException: Read timed out (Multiple computers with different connections)
Is there anything better?
EDIT:
Now that I've gotten jsoup to work, I think a better question is how do I speed it up?

Did you try lengthening the timeout on JSoup? It's only 3 seconds by default, I believe. See e.g. this.

I will suggest Nutch, an open source web-search solution that includes support for HTML parsing. It's a very mature library. It uses Lucene under the hood and I find it to be a very reliable crawler.

A great skill to learn would be xpath. It would be perfect for that job! I just started learning it myself for automation testing. If you have questions, shoot me a message. I'd be glad to help you out, even though I'm not an expert.
Here's a nice link since you are interested in Java:
http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html
xpath is also a good thing to know when you're not using Java, so that's why I would choose that route.

mediawiki syntax parser for android

I need some help fetching the content of a mediawiki page and displaying it in my android app. I've found that mediawiki has a api which can export the markup text, but I need someway to parse it.
Other solutions are also welcome, but I don't just want to include the whole page (with menus and everything) as a webview.

You can get the HTML for just the content section (i.e. not including the menus and everything) by specifying action=render when fetching the page normally. You can get about the same result using the API's action=parse. In either case, you could then supply your own framing HTML, stylesheets, and the like for proper display to the user.

If you are in control of the wiki you may be best served by using the mobile format that wikipedia uses. Otherwise it will depend on what the markup text it returns looks like. If the markup is what you would see in the wiki editor you may have issues. Wikipedia for example make extensive use of templates (or whatever they call them). I'm not sure if you can do much if you don't have the actual temple code behind that.
You also might want to look at http://www.mediawiki.org/wiki/Markup_spec if you are headed down this path.

DOM implementation in java

In a web browser written in java different types of parser have been used to do the parsing and create a DOM document. In the process of rendering how the browser visualize the DOM into J-Component . Can anyone tell me about the whole process of implementing of DOM into J-Component to show the whole web-page in java ?

Here is a link where you can find how to display a DOM Hierarchy into JTree (subclass of JComponent) component:
http://download.oracle.com/docs/cd/E17802_01/j2ee/j2ee/1.4/docs/tutorial-update2/doc/JAXPDOM4.html#wp64186
Hope it will help you.

That is far too large a subject for this forum - Unless you restrict the browser to a specific version of HTML without CSS, without JavaScript (or other scripting languages) and without any embedded objects.
You could look at existing code if you can work within the GPL and other licences.

Well, basically you implement the HTML and CSS standards. Doing so completely and correctly is a HUGE amount of work, several man-years at least. There are some projects are are attempting this, but none have been very successful so far.

Is there a way to change or reskin an incoming website on the fly?

I have a project where they want me to embed a website into a java application and they want the website to have a similar color scheme as the rest of the application. I know a lot about CSS and building websites but I am not aware of a way to change the look of a website as it comes in on the fly. Is there someone who can help?
Update:
I don't have access to the header because it is not my website. To give more info about the project is we have a browser embedded in a java client application. The user needs to access a website that displays the contents of a database. I have no access to the original html or css from the site.
What i need is to change the background color and font sizes of the incoming webpage to match the look and feel of the java application.

One approach would be to replace their CSS with your own.
You could also take the approach used by the Stylish plugin, which involves a lot !important decelerations to override the site's CSS. Since this is a Java app, I assume the user will not have opportunity to supply their own CSS, so using !important here doesn't precisely go against the standard.

In your particular situation, I'd look into data scraping, all you need to do is scrape the website for the data, and then re-style it to present it how you want.
Good luck

The Greasemonkey add-on for Firefox does just this. You can write a bit of Javascript code and have it run when certain web pages load. One common thing to use it for is to make changes to the DOM to move page elements around, hide or resize elements, change colors, etc. There are a bunch of examples at userscripts.org if you want to get an idea of what I am talking about.
Your code would simply need to do something similar. Load the page (including the normal style sheets) and then use the DOM to make changes to style elements as desired. Browse through the source of the page to get the names/ids of important elements, and your code can key off of those. Loading an additional CSS file containing your changes is an option, but doing it programmatically will likely give you more flexibility in the event that the target website changes.

Depends on what do you use to show the pages in Java. Most browser implementations support dynamic changes to the DOM, so you can simply add a CSS file to header as a last element, and it will be applied.

you need to know the markup of the html / css so you can make the best skin.
you could theoretically do it by styling just the basic tags: h1...h6, p, etc... but it would not be as good and would probably fail to produce the best results at times and even produce horrible things at times.
if you KNOW the site markup then you can make a skin and simply use CSS/images to skin it as you wanted it.
just include your CSS markup LAST so that it overrides the one already present on the site that you want to skin differently.
should not be a difficult thing per se. the skin itself is probably the better (more effort required) part of the job.

On the fly, should mean changing the html fetched. So parsing and replacing tokens seems to be a/the way.
You could change the locations of the style sheet files by replacing the href value in a link that points to a css file, and set the value to your style sheet (a different URI).
<link type="text/css" href="mylocalURI" rel="stylesheet />
(this should be the result of a process/replacement)
I think you understand what should happend for inline styles.

I would use JTidy to normalize the original site HTML to XHTML, then use XSLT to filter only the interesting/relevant information, obtaining XML format; and finally (since I wouln't want to convert XML to objects), XSLT again to transform the "pure" XML into the HTML look & feel I need/want.
All of this can be assembled as streams, using more or less 4 Kb of buffer per filter (12 Kb total) per thread. Also meaning that it will run fast enough. And all built on standard, open-source available components.
Cheers.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.