A good HTML object model in Java? - java

I'm looking for an HTML object model in Java, capable of parsing HTML (not required) and containing all HTML elements (and CSS as well) in an elegant object model.
I'm looking for a pure java version of the Groovy's HTML builder.
(I have no luck on google with this request.)
I want to be able to perform stuff like:
HTML html = new HTML();
Body body = html.body();
Table table body.addTable(myCssStyle);
Row row = table.addRow("a", "b", "c").withCss(cssRowStyle);
and so on...

Check out Jsoup:
Example:
(Building some html)
Document doc = Document.createShell("");
Element headline = doc.body().appendElement("h1").text("thats a headline");
Element pTag = doc.body().appendElement("p").text("some text ...");
Element span = pTag.prependElement("span").text("That's");
System.out.println(doc);
Output:
<html>
<head></head>
<body>
<h1>thats a headline</h1>
<p><span>That's</span>some text ...</p>
</body>
</html>
Documentation:
Codebook
API Documentation (JavaDoc)

Jakarta ECS might be able to do what you want.

Just an idea: you could take a look at the source code of xhtmlrenderer project.
http://code.google.com/p/flying-saucer//
It's not plain HTML (it's XHTML), but may be a good starting point, don't you think?

Related

JSOUP: Extracting text between <div class = "..." > <p>text i want to extract</p> </div> tags [duplicate]

I am trying to select, using Jsoup, a <div> that has multiple classes:
<div class="content-text right-align bold-font">...</div>
The syntax for doing so, to the best of my understanding, should be:
document.select("div.content-text.right-align.bold-font");
However, for some reason, this doesn't work for me.
When I try the same exact syntax on JSFIDDLE, it works without a hitch.
Does multi-class selection work in Jsoup?
(I'd rather find out that this is a bug in my code than find out that this is a Jsoup limitation :)
UPDATE (thanks to the answer below): Jsoup works perfectly with the aforementioned syntax.
Works for me with latest Jsoup (1.5.2).
String html = "<div class=\"content-text right-align bold-font\">foo</div>";
Document document = Jsoup.parse(html);
Elements elements = document.select("div.content-text.right-align.bold-font");
System.out.println(elements.text()); // foo
So either you're possibly using an outdated version of Jsoup which exposes a bug related to this, or the actual HTML doesn't contain a <div> like that.
It would by helpfull for you in near future. Have fun.
Jsoup selectors,
jQuery selectors

separate html coded string and normal string

I want to split a single string containing normal text as well as html code into array of string. I tried to search on google but not found any suitable suggestion.
Consider the following string:
blahblahblahblahblahblahblahblahblahblah
blahblah First para blahblahblahblah
blahblahblahblahblahblahblahblahblahblah
<html>
<body>
<p>hello</p>
</body>
</html>
blahblahblahblahblahblahblahblahblahblah
blahblah Second Para lahblahblahblahblah
blahblahblahblahblahblahblahblahblahblah
this becomes:
s[0]=whole first para
s[1]=html code
s[2]=whole second para
Is it possible through jsoup ?. Or I need any other api?
It is possible with jQuery. Here below is a code snippet.
var str = "blablabla <html><body><p>hello</p></body></html> blabla";
var parsedHTML = $.parseHTML(str);
myList = [];
// loop through parsed text and put it into text based on its type
$.each(parsedHTML, function( i, el ) {
if (el.nodeType < 3) myList[i] = el.nodeName;
else myList[i] = el.data;
});
// use myList ...
Here is a fiddle which shows you that it works. The only disadvantage is that both <html> and <body> tag is parsed and not being obtained in the parsedHTML.
jsfiddle example
This can be done with JSoup
Simple use example:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Then you can navigate the DOM structure to extract the information.
update
To get the text with all the tags you could wrap the entire string in <meta> ... </meta> tags; then parse it, access the individual components, and finally serialize the components back into strings.
Alternatively if you believe the code is well formed (with matching beginning and end tags) you could search for the first match of the regex
/<(html|body)\s*>/
Depending on what the contents of the first tag (match) are you then look for the last occurrence of the matching close tag.
More manual, more prone to error, not recommended. But since you have a non- standard problem it seems you might want a non-standard solution .

How to parse HTML and get CSS styles

I need to parse HTML and find corresponding CSS styles. I can parse HTML and CSS separataly, but I can't combine them. For example, I have an XHTML page like this:
<html>
<head>
<title></title>
</head>
<body>
<div class="abc">Hello World</div>
</body>
</html>
I have to search for "hello world" and find its class name, and after that I need to find its style from an external CSS file. Answers using Java, JavaScript, and PHP are all okay.
Use jsoup library in java which is a HTML Parser. You can see for example here
For example you can do something like this:
String html="<<your html content>>";
Document doc = Jsoup.parse(html);
Element ele=doc.getElementsContainingOwnText("Hello World").first.clone(); //get tag containing Hello world
HashSet<String>class=ele.classNames(); //gives you the classnames of element containing Hello world
You can explore the library further to fit your needs.
Similiar question Can jQuery get all CSS styles associated with an element?. Maybe css optimizers can do what you want, take a look at unused-css.com its online tool but also lists other tools.
As i understood you have chance to parse style sheet from external file and this makes your task easy to solve. First try to parse html file with jsoup which supports jquery like selector syntax that helps you parse complicated html files easier. then check this previous solution to parse css file. Im not going to full solution as i state with these libraries all task done internally and the only thing you should do is writing glue code to combine these two.
Using Java java.util.regex
String s = "<body>...<div class=\"abc\">Hello World</div></body>";
Pattern p = Pattern.compile("<div.+?class\\s*?=\\s*['\"]?([^ '\"]+).*?>Hello World</div>", Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1));
}
prints abc

Java Html parser to extract specific data?

I have a html file like the following
...
<span itemprop="A">234</span>
...
<span itemprop="B">690</span>
...
In this i want to extract values as A and B.
Can u suggest any html parser library for java that can do this easily?
Personally, I favour JSoup over JTidy. It has CSS-like selectors, and the documentation is much better, imho. With JSoup, you can easily extract those values with the following lines:
Document doc = Jsoup.connect("your_url").get();
Elements spans = doc.select("span[itemprop]");
for (Element span : spans) {
System.out.println(span.text()); // will print 234 and 690
}
http://jsoup.org/
JSoup is the way to go.
JTidy is a confusingly named yet respected HTML parser.

Add custom css to html code with jsoup

I'm working on an Android app, which loads a HTML page and shows it in a webview.
The problem is I want to add my custom css (the loaded HTML hasn't any CSS or link to a css). How do I add the custom css to the HTML code using jsoup?
I cant modify the html.
And how does the webview can open it afterwards?
Thank you
Several ways. You can use Element#append() to append some piece of HTML to the element.
Document document = Jsoup.connect(url).get();
Element head = document.head();
head.append("<link rel=\"stylesheet\" href=\"http://example.com/your.css\">");
Or, use Element#attr(name, value) to add attributes to existing elements. Here's an example which adds style="color:pink;" to all links.
Document document = Jsoup.connect(url).get();
Elements links = document.select("a");
links.attr("style", "color:pink;");
Either way, after modification get the final HTML string by Document#html().
String html = document.html();
Write it to file by PrintWriter#write() (with the right charset).
String charset = Jsoup.connect(url).response().charset();
// ...
Writer writer = new PrintWriter("/file.html", charset);
writer.write(html);
writer.close();
Finally open it in the webview. Since I can't tell it from top of head, here's just a link with an example which I think is helpful: WebViewDemo.java. I found the link on this blog by the way (which I in turn found by Google).
Probably the easiest way is to search and replace on the HTML text to insert your custom styles, before loading it into your WebView. I do this in my app BBC News to restyle the news article page slightly. My code looks like this:
text = text.replace("</head>",
"<style>h1 {font-size: x-large;} h1, div.date, div.storybody, img {margin:4px; padding:4px; line-height:1.25;}</style></head>");
See how I search and replace on the end head tag (including my own </head> tag in the replaced segment. This ensures that the new snippet goes in the right pace on the page.
There a a few ways to include ccs in html
Tis i use if you have it stored as a external file:
<head><link rel="stylesheet" type="text/css" href="mystyle.css" /></head>
If You want to put it stight i the html file:
<head>
<style type="text/css">
hr {color:sienna;}
p {margin-left:20px;}
body {background-image:url("images/back40.gif");}
</style>
</head>
Or if you wnat to modify a singel tag:
<p style="color:sienna;margin-left:20px">This is a paragraph.</p>
*Edit
Any of thees examples shouldn't have any problem whit displaying.
Ref: W3 Schools CSS

Categories

Resources