I need to parse HTML and find corresponding CSS styles. I can parse HTML and CSS separataly, but I can't combine them. For example, I have an XHTML page like this:
<html>
<head>
<title></title>
</head>
<body>
<div class="abc">Hello World</div>
</body>
</html>
I have to search for "hello world" and find its class name, and after that I need to find its style from an external CSS file. Answers using Java, JavaScript, and PHP are all okay.
Use jsoup library in java which is a HTML Parser. You can see for example here
For example you can do something like this:
String html="<<your html content>>";
Document doc = Jsoup.parse(html);
Element ele=doc.getElementsContainingOwnText("Hello World").first.clone(); //get tag containing Hello world
HashSet<String>class=ele.classNames(); //gives you the classnames of element containing Hello world
You can explore the library further to fit your needs.
Similiar question Can jQuery get all CSS styles associated with an element?. Maybe css optimizers can do what you want, take a look at unused-css.com its online tool but also lists other tools.
As i understood you have chance to parse style sheet from external file and this makes your task easy to solve. First try to parse html file with jsoup which supports jquery like selector syntax that helps you parse complicated html files easier. then check this previous solution to parse css file. Im not going to full solution as i state with these libraries all task done internally and the only thing you should do is writing glue code to combine these two.
Using Java java.util.regex
String s = "<body>...<div class=\"abc\">Hello World</div></body>";
Pattern p = Pattern.compile("<div.+?class\\s*?=\\s*['\"]?([^ '\"]+).*?>Hello World</div>", Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1));
}
prints abc
Related
I have a html text file that has headings I would like to extract the only the text inside
Example:
<h1 class="title">Fire Safety</h1>
<h1>About this book</h1>
<h1>1</h1>
<h1>Contents of this book</h1>
I would like extract only the following text from HTML code:
Fire Safety,
About this book,
1,
Contents of this book
I tried lot of things like:
Pattern pattern = Pattern.compile("<a[^>]href\\s=\\s*\"\\s*([^\"]*)");
Matcher matcher = pattern.matcher(input);
where input is the html data.
Didn't get any results on the console or sometimes are i am getting only href :(
How do I get to fix this?
Let me know!
Thanks!
I would strongly recommend to use an HTML parser, something like TagSoup, Jericho, NekoHTML, HTML Parser, etc
I have a html file like the following
...
<span itemprop="A">234</span>
...
<span itemprop="B">690</span>
...
In this i want to extract values as A and B.
Can u suggest any html parser library for java that can do this easily?
Personally, I favour JSoup over JTidy. It has CSS-like selectors, and the documentation is much better, imho. With JSoup, you can easily extract those values with the following lines:
Document doc = Jsoup.connect("your_url").get();
Elements spans = doc.select("span[itemprop]");
for (Element span : spans) {
System.out.println(span.text()); // will print 234 and 690
}
http://jsoup.org/
JSoup is the way to go.
JTidy is a confusingly named yet respected HTML parser.
I have a requirement where I need to get the last HREF in the HTML code, means getting the HREF in the footer of the page.
Is there any direct regular expression for the same?
No regex, use the :last jQuery selector instead.
demo :
foo
bar
var link = $("a:last");
You could use plain JavaScript for this (if you don't need it to be a jQuery object):
var links = document.links;
var lastLink = links[links.length - 1];
var lastHref = lastLink.href;
alert(lastHref);
JS Fiddle demo.
Disclaimer: the above code only works using JavaScript; as HTML itself has no regex, or DOM manipulation, capacity. If you need to use a different technology please leave a comment or edit your question to include the relevant tags.
It's not a good idea to parse html with regular expressions. Have a look at HtmlParser
to parse html.
I'm working on an Android app, which loads a HTML page and shows it in a webview.
The problem is I want to add my custom css (the loaded HTML hasn't any CSS or link to a css). How do I add the custom css to the HTML code using jsoup?
I cant modify the html.
And how does the webview can open it afterwards?
Thank you
Several ways. You can use Element#append() to append some piece of HTML to the element.
Document document = Jsoup.connect(url).get();
Element head = document.head();
head.append("<link rel=\"stylesheet\" href=\"http://example.com/your.css\">");
Or, use Element#attr(name, value) to add attributes to existing elements. Here's an example which adds style="color:pink;" to all links.
Document document = Jsoup.connect(url).get();
Elements links = document.select("a");
links.attr("style", "color:pink;");
Either way, after modification get the final HTML string by Document#html().
String html = document.html();
Write it to file by PrintWriter#write() (with the right charset).
String charset = Jsoup.connect(url).response().charset();
// ...
Writer writer = new PrintWriter("/file.html", charset);
writer.write(html);
writer.close();
Finally open it in the webview. Since I can't tell it from top of head, here's just a link with an example which I think is helpful: WebViewDemo.java. I found the link on this blog by the way (which I in turn found by Google).
Probably the easiest way is to search and replace on the HTML text to insert your custom styles, before loading it into your WebView. I do this in my app BBC News to restyle the news article page slightly. My code looks like this:
text = text.replace("</head>",
"<style>h1 {font-size: x-large;} h1, div.date, div.storybody, img {margin:4px; padding:4px; line-height:1.25;}</style></head>");
See how I search and replace on the end head tag (including my own </head> tag in the replaced segment. This ensures that the new snippet goes in the right pace on the page.
There a a few ways to include ccs in html
Tis i use if you have it stored as a external file:
<head><link rel="stylesheet" type="text/css" href="mystyle.css" /></head>
If You want to put it stight i the html file:
<head>
<style type="text/css">
hr {color:sienna;}
p {margin-left:20px;}
body {background-image:url("images/back40.gif");}
</style>
</head>
Or if you wnat to modify a singel tag:
<p style="color:sienna;margin-left:20px">This is a paragraph.</p>
*Edit
Any of thees examples shouldn't have any problem whit displaying.
Ref: W3 Schools CSS
I'm looking for an HTML object model in Java, capable of parsing HTML (not required) and containing all HTML elements (and CSS as well) in an elegant object model.
I'm looking for a pure java version of the Groovy's HTML builder.
(I have no luck on google with this request.)
I want to be able to perform stuff like:
HTML html = new HTML();
Body body = html.body();
Table table body.addTable(myCssStyle);
Row row = table.addRow("a", "b", "c").withCss(cssRowStyle);
and so on...
Check out Jsoup:
Example:
(Building some html)
Document doc = Document.createShell("");
Element headline = doc.body().appendElement("h1").text("thats a headline");
Element pTag = doc.body().appendElement("p").text("some text ...");
Element span = pTag.prependElement("span").text("That's");
System.out.println(doc);
Output:
<html>
<head></head>
<body>
<h1>thats a headline</h1>
<p><span>That's</span>some text ...</p>
</body>
</html>
Documentation:
Codebook
API Documentation (JavaDoc)
Jakarta ECS might be able to do what you want.
Just an idea: you could take a look at the source code of xhtmlrenderer project.
http://code.google.com/p/flying-saucer//
It's not plain HTML (it's XHTML), but may be a good starting point, don't you think?