how to unescape html entities **except** < > & " &apos; in java - java

I have html input in utf-8. In this input accented characters are presented as html entities. For example:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő<b</body>
</html>
My goal is to "canonicalize" the html by replacing html entities with utf-8 characters where possible in Java. In other words, replace all entities except < > & " &apos;.
The goal:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő<b</body>
</html>
I need this to make it easier to compare htmls in tests, and to be easier to read for the naked eye (lots of escaped accented characters makes it very hard to read).
I don't care cdata sections (there's no cdata in the inputs).
I have tried JSOUP (https://jsoup.org/) and Apache's Commons Text (https://commons.apache.org/proper/commons-text/) unsuccessfully:
public void test() throws Exception {
String html =
"<html><head><META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">" +
"</head><body>árvíztűrő<b</body></html>";
// this is not good, keeps only the text content
String s1 = Jsoup.parse(html).text();
System.out.println("s1: " + s1);
// this is better, but it unescapes the < which is not what I want
String s2 = StringEscapeUtils.unescapeHtml4(html);
System.out.println("s2: " + s2);
}
The StringEscapeUtils.unescapeHtml4() is almost what I need, but it unfortunately unescapes the < also:
<body>árvíztűrő<b</body>
How should I do it?
Here is a minimal demonstration: https://github.com/riskop/html_utf8_canon.git

Looking into the Commons Text source it is clear that StringEscapeUtils.unescapeHtml4() delegates work to an AggregateTranslator, which is composed of 4 CharSequenceTranslator:
new AggregateTranslator(
new LookupTranslator(EntityArrays.BASIC_UNESCAPE),
new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
new NumericEntityUnescaper()
);
I need only three of the translators to fullfill my goal.
So this is it:
// this is what I needed!
String s3 = new AggregateTranslator(
new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
new NumericEntityUnescaper()
).translate(html);
System.out.println("s3: " + s3);
Whole method:
#Test
public void test() throws Exception {
String html =
"<html><head><META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">" +
"</head><body>árvíztűrő<b</body></html>";
// this is what I needed!
CharSequenceTranslator UNESCAPE_HTML_EXCEPT_BASIC = new AggregateTranslator(
new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
new NumericEntityUnescaper()
);
String s3 = UNESCAPE_HTML_EXCEPT_BASIC.translate(html);
System.out.println("s3: " + s3);
}
Result:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>árvíztűrő<b</body>
</html>

Related

Print from WebView by setting Paper size Javafx

I am generating HTML dynamically for invoices and use WebView to show previews. And to print those invoices.
But when I try to set the Page width by manually creating a Paper Object from PrintHelper Class which I know is not a good idea but there is no other option as I have to specify the page width as 88mm for thermal printers, the result of this process comes as this.
which clearly shows a significant margin on the left. But this is not the case if the width of the Paper is >= 90mm. see this
HTML Page
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<p>This is a really long text to show the width and the margins of page printed on PDF</p>
</body>
</html>
Java code to generate PDF
public void printPDF(Window window, WebEngine engine) {
var printerJob = PrinterJob.createPrinterJob();
if (printerJob == null) return false;
boolean success = printerJob.showPrintDialog(window);
var jobSetting = printerJob.getJobSettings();
var paper = PrintHelper.createPaper("Thermal88mm", 88, 300, Units.MM);
jobSetting.setPageLayout(printerJob.getPrinter().createPageLayout(paper, PageOrientation.PORTRAIT, Printer.MarginType.HARDWARE_MINIMUM));
if (success) {
engine.print(printerJob);
printerJob.endJob();
}
}
I hope anyone can help me.

Velocity and many objects in model

I'm trying to send email using velocity templates here is my template
#set ($subject = "new message in system $message.mode")
#set ($adress = "#email.com")
#set ($from = "$user.login$adress")
#set ($to = "$interlocutor.login$adress")
<html>
<head>
<meta charset="UTF-8">
</head>
<body>
<h3>hi you have new message.</h3>
</body>
</html>
and how i add elemts to model
model.put("user", interviewer);
model.put("interlocutor", interlocutor);
model.put("message", mode);
everything is great except when i print it with
System.out.println((String)velocityContext.get("to") + " " + (String)velocityContext.get("subject") +" " + (String)velocityContext.get("from")+ " ");
i recive
$interlocutor.login#email.com new message in system $message.mode testlogin#email.com
So it looks like velocity coudl handle only one object in model, others are invisible for engine, do you have clue whats wrong?

How to inject snippets of html into an string containing valid html?

I have the following html (sized down for literary content) that is passed into a java method.
However, I want to take this passed in html string and add a <pre> tag that contains some text passed in and add a section of <script type="text/javascript"> to the head.
String buildHTML(String htmlString, String textToInject)
{
// Inject inject textToInject into pre tag and add javascript sections
String newHTMLString = <build new html sections>
}
-- htmlString --
<html>
<head>
</head>
<body>
<body>
</html>
-- newHTMLString
<html>
<head>
<script type="text/javascript">
window.onload=function(){alert("hello?";}
</script>
</head>
<body>
<div id="1">
<pre>
<!-- Inject textToInject here into a newly created pre tag-->
</pre>
</div>
<body>
</html>
What is the best tool to do this from within java other than a regex?
Here's how to do this with Jsoup:
public String buildHTML(String htmlString, String textToInject)
{
// Create a document from string
Document doc = Jsoup.parse(htmlString);
// create the script tag in head
doc.head().appendElement("script")
.attr("type", "text/javascript")
.text("window.onload=function(){alert(\'hello?\';}");
// Create div tag
Element div = doc.body().appendElement("div").attr("id", "1");
// Create pre tag
Element pre = div.appendElement("pre");
pre.text(textToInject);
// Return as string
return doc.toString();
}
I've used chaining a lot, what means:
doc.body().appendElement(...).attr(...).text(...)
is exactly the same as
Element example = doc.body().appendElement(...);
example.attr(...);
example.text(...);
Example:
final String html = "<html>\n"
+ " <head>\n"
+ " </head>\n"
+ " <body>\n"
+ " <body>\n"
+ "</html>";
String result = buildHTML(html, "This is a test.");
System.out.println(result);
Result:
<html>
<head>
<script type="text/javascript">window.onload=function(){alert('hello?';}</script>
</head>
<body>
<div id="1">
<pre>This is a test.</pre>
</div>
</body>
</html>

Unable to redirect JSP

I am using the sendRedirect() method. But it doesn't. Please have a look at the following code:-
<%#page import="utility.ConnectionClass,java.sql.* "%>
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>processadmin</title>
</head>
<body>
<%
Connection con=null;
ConnectionClass obj=new ConnectionClass();
con=obj.createConnection(con);
String user=request.getParameter("user");
String pass=request.getParameter("pass");
String sql="select * from admin where username='"+user+"'";
Statement stat=con.createStatement();
ResultSet rs=stat.executeQuery(sql);
rs.next();
if((rs.getString(1)==user)&&(rs.getString(2)==pass))
response.sendRedirect("processadmin.jsp");
else
out.println("Not working");
%>
</body>
</html>
And when I run this I get the output :- Not Working
Compare String using equals() method . == compares String references , not actual contents of the String.
if(user.equals(rs.getString(1)) && pass.equals(rs.getString(2)))
Note:- Please don't use scriptlets in JSP . It is a bad practice. Read this.

Extract CSS Styles from HTML using JSOUP in JAVA

Can anyone help with extraction of CSS styles from HTML using Jsoup in Java.
For e.g in below html i want to extract .ft00 and .ft01
<HTML>
<HEAD>
<TITLE>Page 1</TITLE>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<DIV style="position:relative;width:931;height:1243;">
<STYLE type="text/css">
<!--
.ft00{font-size:11px;font-family:Times;color:#ffffff;}
.ft01{font-size:11px;font-family:Times;color:#ffffff;}
-->
</STYLE>
</HEAD>
</HTML>
If the style is embedded in your Element you just have to use .attr("style").
JSoup is not a Html renderer, it is just a HTML parser, so you will have to parse the content from the retrieved <style> tag html content. You can use a simple regex for this; but it won't work in all cases. You may want to use a CSS parser for this task.
public class Test {
public static void main(String[] args) throws Exception {
String html = "<HTML>\n" +
"<HEAD>\n"+
"<TITLE>Page 1</TITLE>\n"+
"<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n"+
"<DIV style=\"position:relative;width:931;height:1243;\">\n"+
"<STYLE type=\"text/css\">\n"+
"<!--\n"+
" .ft00{font-size:11px;font-family:Times;color:#ffffff;}\n"+
" .ft01{font-size:11px;font-family:Times;color:#ffffff;}\n"+
"-->\n"+
"</STYLE>\n"+
"</HEAD>\n"+
"</HTML>";
Document doc = Jsoup.parse(html);
Element style = doc.select("style").first();
Matcher cssMatcher = Pattern.compile("[.](\\w+)\\s*[{]([^}]+)[}]").matcher(style.html());
while (cssMatcher.find()) {
System.out.println("Style `" + cssMatcher.group(1) + "`: " + cssMatcher.group(2));
}
}
}
Will output:
Style `ft00`: font-size:11px;font-family:Times;color:#ffffff;
Style `ft01`: font-size:11px;font-family:Times;color:#ffffff;
Try this:
Document document = Jsoup.parse(html);
String style = document.select("style").first().data();
You can then use a CSS parser to fetch the details you are interested in.
http://www.w3.org/Style/CSS/SAC
http://cssparser.sourceforge.net
https://github.com/corgrath/osbcp-css-parser#readme

Categories

Resources