Can anyone help with extraction of CSS styles from HTML using Jsoup in Java.
For e.g in below html i want to extract .ft00 and .ft01
<HTML>
<HEAD>
<TITLE>Page 1</TITLE>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<DIV style="position:relative;width:931;height:1243;">
<STYLE type="text/css">
<!--
.ft00{font-size:11px;font-family:Times;color:#ffffff;}
.ft01{font-size:11px;font-family:Times;color:#ffffff;}
-->
</STYLE>
</HEAD>
</HTML>
If the style is embedded in your Element you just have to use .attr("style").
JSoup is not a Html renderer, it is just a HTML parser, so you will have to parse the content from the retrieved <style> tag html content. You can use a simple regex for this; but it won't work in all cases. You may want to use a CSS parser for this task.
public class Test {
public static void main(String[] args) throws Exception {
String html = "<HTML>\n" +
"<HEAD>\n"+
"<TITLE>Page 1</TITLE>\n"+
"<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n"+
"<DIV style=\"position:relative;width:931;height:1243;\">\n"+
"<STYLE type=\"text/css\">\n"+
"<!--\n"+
" .ft00{font-size:11px;font-family:Times;color:#ffffff;}\n"+
" .ft01{font-size:11px;font-family:Times;color:#ffffff;}\n"+
"-->\n"+
"</STYLE>\n"+
"</HEAD>\n"+
"</HTML>";
Document doc = Jsoup.parse(html);
Element style = doc.select("style").first();
Matcher cssMatcher = Pattern.compile("[.](\\w+)\\s*[{]([^}]+)[}]").matcher(style.html());
while (cssMatcher.find()) {
System.out.println("Style `" + cssMatcher.group(1) + "`: " + cssMatcher.group(2));
}
}
}
Will output:
Style `ft00`: font-size:11px;font-family:Times;color:#ffffff;
Style `ft01`: font-size:11px;font-family:Times;color:#ffffff;
Try this:
Document document = Jsoup.parse(html);
String style = document.select("style").first().data();
You can then use a CSS parser to fetch the details you are interested in.
http://www.w3.org/Style/CSS/SAC
http://cssparser.sourceforge.net
https://github.com/corgrath/osbcp-css-parser#readme
Related
I have html input in utf-8. In this input accented characters are presented as html entities. For example:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő<b</body>
</html>
My goal is to "canonicalize" the html by replacing html entities with utf-8 characters where possible in Java. In other words, replace all entities except < > & " '.
The goal:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő<b</body>
</html>
I need this to make it easier to compare htmls in tests, and to be easier to read for the naked eye (lots of escaped accented characters makes it very hard to read).
I don't care cdata sections (there's no cdata in the inputs).
I have tried JSOUP (https://jsoup.org/) and Apache's Commons Text (https://commons.apache.org/proper/commons-text/) unsuccessfully:
public void test() throws Exception {
String html =
"<html><head><META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">" +
"</head><body>árvíztűrő<b</body></html>";
// this is not good, keeps only the text content
String s1 = Jsoup.parse(html).text();
System.out.println("s1: " + s1);
// this is better, but it unescapes the < which is not what I want
String s2 = StringEscapeUtils.unescapeHtml4(html);
System.out.println("s2: " + s2);
}
The StringEscapeUtils.unescapeHtml4() is almost what I need, but it unfortunately unescapes the < also:
<body>árvíztűrő<b</body>
How should I do it?
Here is a minimal demonstration: https://github.com/riskop/html_utf8_canon.git
Looking into the Commons Text source it is clear that StringEscapeUtils.unescapeHtml4() delegates work to an AggregateTranslator, which is composed of 4 CharSequenceTranslator:
new AggregateTranslator(
new LookupTranslator(EntityArrays.BASIC_UNESCAPE),
new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
new NumericEntityUnescaper()
);
I need only three of the translators to fullfill my goal.
So this is it:
// this is what I needed!
String s3 = new AggregateTranslator(
new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
new NumericEntityUnescaper()
).translate(html);
System.out.println("s3: " + s3);
Whole method:
#Test
public void test() throws Exception {
String html =
"<html><head><META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">" +
"</head><body>árvíztűrő<b</body></html>";
// this is what I needed!
CharSequenceTranslator UNESCAPE_HTML_EXCEPT_BASIC = new AggregateTranslator(
new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
new NumericEntityUnescaper()
);
String s3 = UNESCAPE_HTML_EXCEPT_BASIC.translate(html);
System.out.println("s3: " + s3);
}
Result:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő<b</body>
</html>
I am using JSoup to dynamically set the href attribute of a <base/> element in an HTML document. This works as expected apart from the fact the closing </base> tag is omitted from the modified HTML.
Is there any way to have JSOUP return valid XHTML?
Input:
<html><head><base href="xyz"/></head><body></body></html>
Output:
<html>
<head>
<base href="https://myhost:8080/myapp/"> <-- missing closing tag
</head>
<body></body>
</html>
Code:
protected String modifyHtml(HttpServletRequest request, String html)
{
Document document = Jsoup.parse(html);
document.outputSettings().escapeMode(EscapeMode.xhtml);
Elements baseElements = document.select("base");
if (!baseElements.isEmpty())
{
Element base = baseElements.get(0);
base.attr("href", getBaseUrl(request));
}
return document.html();
}
In addition to (or instead of) the escape mode, you want to set the syntax:
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
I have
<meta itemprop="datePublished" content="2015-01-26 12:37:00">
and I want to select the content. I try without success:
Document doc = Jsoup.connect("http://www.somesite.com/index.html").get();
Element link= doc.select("meta").first();
String contetn= link.attr("content");
But in my html I have:
<div style="overflow: visible;" itemscope="" itemtype="http://schema.org/Article">
<meta itemprop="url" content="http://www.somesite.com/index.html">
<meta itemprop="headline" content="some text">
<meta itemprop="datePublished" content="2015-01-26 12:37:00">
<meta itemprop="dateModified" content="2015-01-26 14:03:16">
You can see that I search for the 3-td tag meta and I can't select it.
Element link= doc.select("meta").first();
This will select only the first meta-element found; since you have more than one in your second html, you'll get the wrong result.
But here's an example:
final String html = "<div style=\"overflow: visible;\" itemscope=\"\" itemtype=\"http://schema.org/Article\">\n"
+ "<meta itemprop=\"url\" content=\"http://www.somesite.com/index.html\">\n"
+ "<meta itemprop=\"headline\" content=\"some text\">\n"
+ "<meta itemprop=\"datePublished\" content=\"2015-01-26 12:37:00\">\n"
+ "<meta itemprop=\"dateModified\" content=\"2015-01-26 14:03:16\">";
Document doc = Jsoup.parse(html);
Element meta = doc.select("meta[itemprop=datePublished]").first();
String content = meta.attr("content");
System.out.println(content);
Output: 2015-01-26 12:37:00
This will select all meta-elements with attribute itemprop and attribute value datePublished. From all found, just the first is taken. Finally from the single element you can get the value of the content-attribute.
I have the following html (sized down for literary content) that is passed into a java method.
However, I want to take this passed in html string and add a <pre> tag that contains some text passed in and add a section of <script type="text/javascript"> to the head.
String buildHTML(String htmlString, String textToInject)
{
// Inject inject textToInject into pre tag and add javascript sections
String newHTMLString = <build new html sections>
}
-- htmlString --
<html>
<head>
</head>
<body>
<body>
</html>
-- newHTMLString
<html>
<head>
<script type="text/javascript">
window.onload=function(){alert("hello?";}
</script>
</head>
<body>
<div id="1">
<pre>
<!-- Inject textToInject here into a newly created pre tag-->
</pre>
</div>
<body>
</html>
What is the best tool to do this from within java other than a regex?
Here's how to do this with Jsoup:
public String buildHTML(String htmlString, String textToInject)
{
// Create a document from string
Document doc = Jsoup.parse(htmlString);
// create the script tag in head
doc.head().appendElement("script")
.attr("type", "text/javascript")
.text("window.onload=function(){alert(\'hello?\';}");
// Create div tag
Element div = doc.body().appendElement("div").attr("id", "1");
// Create pre tag
Element pre = div.appendElement("pre");
pre.text(textToInject);
// Return as string
return doc.toString();
}
I've used chaining a lot, what means:
doc.body().appendElement(...).attr(...).text(...)
is exactly the same as
Element example = doc.body().appendElement(...);
example.attr(...);
example.text(...);
Example:
final String html = "<html>\n"
+ " <head>\n"
+ " </head>\n"
+ " <body>\n"
+ " <body>\n"
+ "</html>";
String result = buildHTML(html, "This is a test.");
System.out.println(result);
Result:
<html>
<head>
<script type="text/javascript">window.onload=function(){alert('hello?';}</script>
</head>
<body>
<div id="1">
<pre>This is a test.</pre>
</div>
</body>
</html>
I have the following code:
Document mainContent = new Document();
Element rootElement = new Element("html");
mainContent.setContent(rootElement);
Element headElement = new Element("head");
Element metaElement = new Element("meta");
metaElement.setAttribute("content", "text/html; charset=utf-8");
headElement.addContent(metaElement);
rootElement.addContent(headElement);
org.jdom2.output.Format format = org.jdom2.output.Format.getPrettyFormat().setOmitDeclaration(true);
XMLOutputter outputter = new XMLOutputter(format);
System.out.println(outputter.outputString(mainContent));
This will produce the output :
<html>
<head>
<meta content="text/html; charset=utf-8" />
</head>
</html>
Now, I have the following string:
String links = "<link src=\"mysrc1\" /><link src=\"mysrc2\" />"
How can I add it to the HTML element so the output will be:
<html>
<head>
<meta content="text/html; charset=utf-8" />
<link src="mysrc1" />
<link src="mysrc2" />
</head>
</html>
Please note that it's NOT a valid XML element altogether, but each link is a valid XML Element.
I don't mind using another XML parser if needed. I am already using somewhere else in my code HTMLCleaner if it helps.
You can do something like they mention here. Basically place your xml snippet inside of a root element:
links ="<root>"+links+"</root>";
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(false);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc=builder.parse(links ByteArrayInputStream(xml.getBytes()));
NodeList nl = ((Element)doc.getDocumentElement()).getChildNodes();
for (int temp = 0; temp < nl .getLength(); temp++) {
Node nNode = nl .item(temp);
//Here you create your new Element based on the Node nNode, and the add it to the new DOM you're building
}
Then parse links as a valid XML document, and extract the nodes you want (basically anything other than the root node)