Java HTML normalizer?

Java HTML normalizer? - java

IS there a library which can transform any given HTML page with JS, CSS all over it, into a minimalistic uniform format?
For instance, if we render stackoverflow homepage, I want it to be shown in a minimal format. I want all other sites to be rendered down.
Sort of like Lynx web browser but with minimal graphics.

The best tool for HTML to Lynx style text I have come across is Jericho's Renderer.
It's easy to use:
Source source=new Source(new URL(sourceUrlString)); // or new Source("<html>pass in raw html string</html>");
String renderedText=source.getRenderer().toString();
System.out.println("\nSimple rendering of the HTML document:\n");
System.out.println(renderedText);
(from here)
and handles HTML in the wild (badly formatted) very well.
Here's the first few lines of this page formatted this way using Jericho:
Stack Exchange log in | careers | chat
| meta | about | faq
Stack Overflow
* Questions
* Tags
* Users
* Badges
* Unanswered
* Ask Question
Java HTML normalizer?
**
IS there a library which can transform
any given HTML page with JS, CSS all
over it, into a minimalistic uniform
format?
For instance, if we render
stackoverflow homepage, I want it to
be shown in a minimal format. I want
all other sites to be rendered down.
Sort of like Lynx web browser but with
minimal graphics.
java lynx link|edit|flag asked 2 days
ago Kim Jong Woo 593112 89% accept
rate Do you want to transform your
HTML code to simpler HTML code, or do
your want to show this "minimalistic
uniform format" to your user? Or do
you want to create a image? – Paŭlo
Ebermann yesterday simpler html code
without sacrificing the relative
positioning of the elements. – Kim
Jong Woo 16 hours ago
2 Answers
To answer your firtst question: No. I
don'nt think there is a library for
that purpose. (At least this is what
my "googeling" resulted in).
And i think the reason for this is,
that what you want is a very special
need.
So as a solution for your problem you
can parse the html and display it the
way you want to in a JEditorpane or
whatever you are using for display.
I can only suggest a way i would do it
(this is because i am familiar with
xml and everything around it).
*
Use a library to ensure that your html conforms to xhtml:
http://htmlcleaner.sourceforge.net/release.php
*
then either parse the xml with DOM or SAX parsers and display it the
way you want.
or
* use xslt to transform the document into some other html document
which results in a view that fits your
needs.
or
* use one of the available html parser librarys. (The most of which i
found where kind of outdated (2006))
but they could be an option for you.
This is just one suggestion how you
could do it. I'm sure there are
thousands of other ways which will do
the same thing.

To answer your firtst question: No. I don'nt think there is a library for that purpose. (At least this is what my "googeling" resulted in).
And i think the reason for this is, that what you want is a very special need.
So as a solution for your problem you can parse the html and display it the way you want to in a JEditorpane or whatever you are using for display.
I can only suggest a way i would do it (this is because i am familiar with xml and everything around it).
Use a library to ensure that your html conforms to xhtml: http://htmlcleaner.sourceforge.net/release.php
then either parse the xml with DOM or SAX parsers and display it the way you want.
or
use xslt to transform the document into some other html document which results in a view that fits your needs.
or
use one of the available html parser librarys. (The most of which i found where kind of outdated (2006)) but they could be an option for you.
This is just one suggestion how you could do it. I'm sure there are thousands of other ways which will do the same thing.

Related

Parsing html text to obtain input fields

So I currently have a big blob of html text, and I want to generate an input form based on what is contained in that text. For example, if the text contains '[%Name%]', I want to be able to read that in and recognize 'Name' is there, and so in turn enable a form field for name. There will be multiple tags ([%age%], [%height%], etc.)
I was thinking about using Regex, but after doing some research it seems that Regex is a horrible idea to parse html with. I came across parsing html pages with groovy, but it is not strictly applicable to my implementation. I am storing the html formatted text (which I am creating using ckeditor) in a database.
Is there a efficient way to do this in java/groovy? Or should I just create an algorithm similar to examples shown here (I'm not too sure how effective the given algorithms would be, as they seem to be constructed around relatively small strings, whereas my string to parse through may end up being quite large (a 15-20 page document)).
Thanks in advance

Instead of reimplementing the wheel I think it's better to use jsoup. It is an excellent tool for your task and would be easy to obtain anything in a html page using it's selector syntax. Check out examples of usage in their cookbook.

HTML to ODT – XSLT?

I'm trying to convert single pieces of HTML code to the XML Format the *.odt format (Open Office) is using. For example, <p>This is some text</p> should be translated to <text:p>This is some text</text:p>. Of course, this should also work with lists etc.
I'm not sure whether the best way to go would be using a XSLT processor (and if so, which one for Java?) and create the stylesheet myself – isn't there a Java library out there that can already do this?
I'm using jodconverter to go from ODT->PDF, but even though OpenOffice Writer can handle copy&pasting the content and display it in the desired way, jodconvert doesn't seem to be able to "translate" single pieces of HTML (or am I wrong about that?).
Any ideas and suggestions would be very welcome. I should add that I'm absolutely new to Java. Thanks in advance
Ingo

XSLT is the best way to do it. The OpenDocument group is working on a HTML to ODT xsl template. Sadly, it is not ready yet.
You can check on their website to stay in touch (and get beta work maybe).
Otherwise, you have non official project, also based on XSLT: like this one
It would be easy to apply a little transformation on your HTML to get a valid XHTML before processing it to ODT.
Or just check this other example.

retrieve information from a url

I want to make a program that will retrieve some information a url.
For example i give the url below, from
librarything
How can i retrieve all the words below the "TAGS" tab, like
Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ?
I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice?
EDIT:
You gave me excellent help, but I want to ask something else.
For every tag we can see how many times each tag has been used, when we press the "number" button. How can I retrieve that number also?

You could use a HTML parser like Jsoup. It allows you to select HTML elements of interest using simple CSS selectors:
E.g.
Document document = Jsoup.connect("http://www.librarything.com/work/9767358/78536487").get();
Elements tags = document.select(".tags .tag a");
for (Element tag : tags) {
System.out.println(tag.text());
}
which prints
Black Library
fantasy
Thanquol & Boneripper
Thanquol and Bone Ripper
Warhammer
Please note that you should read website's robots.txt -if any- and read the website's terms of service -if any- or your server might be IP-banned sooner or later.

I've done this before using PHP with a page scrape, then parsing the HTML as a string using Regular Expressions.
Example here
I imagine there's something similar in java and other languages. The concept would be similar:
Load page data.
Parse the data, (i.e. with a regex, or via the DOM model and using some CSS selectors or some XPath selectors.
Do what you want with the data :)
It's worth remembering that some people might not appreciate you data mining their site and profiting / redistrubuting it on a large scale.

How to create templates from html page automatically?

I have a use case in which I need to render an unformatted text in the format of a given web page programmatically in Java. i.e. The text should automatically be formatted like the web page with styles, paragraphs, bullet points etc.
As I see first I will have to analyze the piece of unformatted text to find out the candidates for paragraphs, bullet points, headings etc. I intend to use Lucene analyzers/tokenizers for this task. Are there any alternatives?
The second problem is to convert the formatted web page into some kind of template (e.g. velocity template) with place holders for various entities like titles, bullet points etc.
Is there any text analysis/templating library in Java that can help me do this? Preferably open source.
Are there any other suggestions for doing this sort of task in a better way in Java?
Thanks for your help.

There are a lot of hard parts to what you're doing.
The user input
If you don't ask your user to provide any context, you're never going to guess the structure of the text. At least, you should ask them to provide a title, and a series of paragraph in your GUI.
Ideally, you could ask them to follow a well-know markup language (Markdown, Textile, etc...) and use the open source parser to extract the structure.
The external page
If any page is used, the only things you can rely on are the "structural markup". So assuming you know the title of the page should be "Hello World", and there is a "h1" element somewhere in the page, you can maybe assume that this is where the header could go.
But if the pages is a div tag-soup, and only CSS is used to differentiate the rendering of the header as opposed to the bulk of the text, you're going to have to guess how the styling is done : that's plain impossible if you don't know how the page is made.
I don't think Lucene would help fo this (as far as I know Lucene is made to create an index of the words used in a bulk of text ; I don't think it can help you guessing which part of the text is meant to be a title, a subtitle, etc...)
Generating templates from external page
Assuming you have "guessed" right, you could generate the content by
copy pasting the page
replacing the parts to change with tags of your template language of choice
storing the template somewhere the templating system can access it
configure your template / view system (viewResolver for velocity) to use the right template for the rigth person
That would of course pose terrible legal questions, since your templates would incorporate works by the original website author (most probably copyrighted material)
A more realistic solution
I would suggest you constrain your problem to :
using input that has some structure information available (use a GUI to enter it, use a markup language, whatever)
using templates that you provide, know the structure of (and can reuse very easily)
Note that none of those points are related to the template system.
Otherwise, I'm afraid you're heading to an unreasonnable amount of work...

Sanitize HTML data

I'm fetching data from different RSS / ATOM feeds and sometimes the HTML data I receive contains HTML tags but they dont have close tags or some other issues and it screws up the page layout / styling.
Somethings there is class name / id clash. Is there any way to sanitize it?
If anybody can point me to some reliable Javascript / Java implementation.

You can give JTidy a try.
JTidy can be used as a tool for cleaning up malformed and faulty HTML.
Another option is HTML Cleaner
HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

I have used NekoHTML with great success. It's just a thin layer over the Apache parser that puts it into error-correcting mode, which is a great architecture as every time Apache gets better so does Neko. And there's no huge amount of extra code.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java HTML normalizer? - java

Related

Parsing html text to obtain input fields

HTML to ODT – XSLT?

retrieve information from a url

How to create templates from html page automatically?

Sanitize HTML data

Categories

Resources