API design for Java String results which contain charset-specific data - java

In the API of a document converter, which generates HTML (or XHTML), I want to expose these methods:
// Convert the input file to a file using the specified charset
void convert(File in, File out, Charset charset);
// Convert the input document to a string using the specified charset
String convert(String in, Charset charset);
There is no way for client code to produce faulty documents with the file-based method, it safely writes a result document with the specified charset.
The String based method obviuously will lead to problems, if the client code does not respect the chosen charset - for example if the charset parameter is ISO-8859-1 but the result String is served as UTF-8 content in a web application:
String html = convert(getInputDocument(), ISO_8859_1);
...
response.setContentType("text/html;charset=UTF-8");
response.setCharacterEncoding("UTF-8");
try (PrintWriter out = response.getWriter()) {
out.print(html);
}
Question: which options should I consider to design the API so that users are guided to correct usage of the result string?
deprecate the method and provide a method which returns a byte array
use method names which contain the encoding (convertToUTF_8, convertToISO_8859_1 ...)
The result string could for example be
<!DOCTYPE html>
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Untitled document</title>
</head>
<body>
<p>Motörhead</p>
</body>
</html>

I don't know your exact use-case, but one possibility is to protect document with a proper object context (instead of it just being a String):
public interface Document {
void writeTo(ServletResponse response);
}
This way you can retain all control of how that "string" can be written to different targets.
I'm not sure whether you need a convert at all, since the document could automatically convert its content if it sees that the response already has a different encoding. But even if you need a convert you could do it this way:
public interface Document {
void writeTo(ServletResponse response);
Document convert(Charset targetCharset);
}
This would return a new document which is of a different charset.

Related

Java GB2312 string in HTML does not display correctly

I am trying to read in HTML from Chinese websites and get their <title> value. All the websites with UTF-8 encoding works fine, but not for GB2312 websites (for example, m.39.net, which shows 39������_�й����ȵĽ����Ż���վ instead of 39健康网_中国领先的健康门户网站).
Here is the code I use to accomplish that:
URL url = new URL(urlstr);
URLConnection connection = url.openConnection();
inputStream = connection.getInputStream();
String content = IOUtils.toString(inputStream);
String content = IOUtils.toString(inputStream, "GB2312"); may do the help.
If you want to detect the charset of a webpage, there are 3 ways as far as I know:
use connection.getContentEncoding() to get the charset described in the HTTP header;
parse <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"> or <meta charset="UTF-8"> in the HTML code (have to download the HTML content first and then read several lines);
use 3rd party libraries. E.g. those mentioned in this question.
Have you seen http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/IOUtils.html
toString(byte[] input, String encoding)

How to escape HTML by default in StringTemplate?

It is very good practice in HTML template engines to HTML-escape by default placeholder text to help prevent XSS (cross-site scripting) attacks. Is it possible to achieve this behavior in StringTemplate?
I have tried to register custom AttributeRenderer which escapes HTML unless format is "raw":
stg.registerRenderer(String.class, new AttributeRenderer() {
#Override
public String toString(Object o, String format, Locale locale) {
String s = (String)o;
return Objects.equals(format, "raw") ? s : StringEscapeUtils.escapeHtml4(s);
}
});
But it fails because in this case StringTemlate escapes not only placeholder text but also template text itself. For example this template:
example(title, content) ::= <<
<html>
<head>
<title>$title$</title>
</head>
<body>
$content; format = "raw"$
</body>
</html>
>>
Is rendered as:
<html>
<head>
<title>Example Title</title>
</head>
<body>
<p>Not escaped because of <code>format = "raw"</code>.</p>
</body>
</html>
Can anybody help?
There is no good solution to encode by default. The template is passed through the AttributeRenderer for the string data type, and there is no context information to detect if it is processing the template or a variable. So all strings, including the template, are encoded by default since you cannot specify "raw" for the template.
An alternative solution is to use format="xml-encode" in the variables that need to be encoded. The built-in StringRenderer has support for several formats:
upper
lower
cap
url-encode
xml-encode
So your example would be:
example(title, content) ::= <<
<html>
<head>
<title>$title; format="xml-encode"$</title>
</head>
<body>
$content$
</body>
</html>
>>
In order to encode by default, you have limited options. The alternatives are:
Use a custom data type (not String) for your variables, so you can register your HtmlEscapeStringRenderer for the custom data type. This is difficult if you use complex objects as variables that are already using standard strings.
Add the raw and the escaped variables to the model manually, e.g. add title (escaped) and title_raw (raw). You do not need a custom AttributeRenderer in this case. StringTemplate has a strict view/model separation and you need to have the model populated before it is rendered with both the raw and escaped values.
Neither option is particularly desirable, but I do not see any other alternatives with StringTemplate4.
The answer is to revert to StringTemplate v3.

Write HTML file using Java

I want my Java application to write HTML code in a file. Right now, I am hard coding HTML tags using java.io.BufferedWriter class. For Example:
BufferedWriter bw = new BufferedWriter(new FileWriter(file));
bw.write("<html><head><title>New Page</title></head><body><p>This is Body</p></body></html>");
bw.close();
Is there any easier way to do this, as I have to create tables and it is becoming very inconvenient?
If you want to do that yourself, without using any external library, a clean way would be to create a template.html file with all the static content, like for example:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>$title</title>
</head>
<body>$body
</body>
</html>
Put a tag like $tag for any dynamic content and then do something like this:
File htmlTemplateFile = new File("path/template.html");
String htmlString = FileUtils.readFileToString(htmlTemplateFile);
String title = "New Page";
String body = "This is Body";
htmlString = htmlString.replace("$title", title);
htmlString = htmlString.replace("$body", body);
File newHtmlFile = new File("path/new.html");
FileUtils.writeStringToFile(newHtmlFile, htmlString);
Note: I used org.apache.commons.io.FileUtils for simplicity.
A few months ago I had the same problem and every library I found provides too much functionality and complexity for my final goal. So I end up developing my own library - HtmlFlow - that provides a very simple and intuitive API that allows me to write HTML in a fluent style. Check it here: https://github.com/fmcarvalho/HtmlFlow (it also supports dynamic binding to HTML elements)
Here is an example of binding the properties of a Task object into HTML elements. Consider a Task Java class with three properties: Title, Description and a Priority and then we can produce an HTML document for a Task object in the following way:
import htmlflow.HtmlView;
import model.Priority;
import model.Task;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
public class App {
private static HtmlView<Task> taskDetailsView(){
HtmlView<Task> taskView = new HtmlView<>();
taskView
.head()
.title("Task Details")
.linkCss("https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css");
taskView
.body().classAttr("container")
.heading(1, "Task Details")
.hr()
.div()
.text("Title: ").text(Task::getTitle)
.br()
.text("Description: ").text(Task::getDescription)
.br()
.text("Priority: ").text(Task::getPriority);
return taskView;
}
public static void main(String [] args) throws IOException{
HtmlView<Task> taskView = taskDetailsView();
Task task = new Task("Special dinner", "Have dinner with someone!", Priority.Normal);
try(PrintStream out = new PrintStream(new FileOutputStream("Task.html"))){
taskView.setPrintStream(out).write(task);
Desktop.getDesktop().browse(URI.create("Task.html"));
}
}
}
You can use jsoup or wffweb (HTML5) based.
Sample code for jsoup:-
Document doc = Jsoup.parse("<html></html>");
doc.body().addClass("body-styles-cls");
doc.body().appendElement("div");
System.out.println(doc.toString());
prints
<html>
<head></head>
<body class=" body-styles-cls">
<div></div>
</body>
</html>
Sample code for wffweb:-
Html html = new Html(null) {{
new Head(this);
new Body(this,
new ClassAttribute("body-styles-cls"));
}};
Body body = TagRepository.findOneTagAssignableToTag(Body.class, html);
body.appendChild(new Div(null));
System.out.println(html.toHtmlString());
//directly writes to file
html.toOutputStream(new FileOutputStream("/home/user/filepath/filename.html"), "UTF-8");
prints (in minified format):-
<html>
<head></head>
<body class="body-styles-cls">
<div></div>
</body>
</html>
Velocity is a good candidate for writing this kind of stuff.
It allows you to keep your html and data-generation code as separated as possible.
I would highly recommend you use a very simple templating language such as Freemarker
It really depends on the type of HTML file you're creating.
For such tasks, I use to create an object, serialize it to XML, then transform it with XSL. The pros of this approach are:
The strict separation between source code and HTML template,
The possibility to edit HTML without having to recompile the application,
The ability to serve different HTML in different cases based on the same XML, or even serve XML directly when needed (for a further deserialization for example),
The shorter amount of code to write.
The cons are:
You must know XSLT and know how to implement it in Java.
You must write XSLT (and it's torture for many developers).
When transforming XML to HTML with XSLT, some parts may be tricky. Few examples: <textarea/> tags (which make the page unusable), XML declaration (which can cause problems with IE), whitespace (with <pre></pre> tags etc.), HTML entities ( ), etc.
The performance will be reduced, since serialization to XML wastes lots of CPU resources and XSL transformation is very costly too.
Now, if your HTML is very short or very repetitive or if the HTML has a volatile structure which changes dynamically, this approach must not be taken in account. On the other hand, if you serve HTML files which have all a similar structure and you want to reduce the amount of Java code and use templates, this approach may work.
I had also problems in finding something simple to satisfy my needs so I decided to write my own library (with MIT license).
It's mainly based on composite and builder pattern.
A basic declarative example is:
import static com.github.manliogit.javatags.lang.HtmlHelper.*;
html5(attr("lang -> en"),
head(
meta(attr("http-equiv -> Content-Type", "content -> text/html; charset=UTF-8")),
title("title"),
link(attr("href -> xxx.css", "rel -> stylesheet"))
)
).render();
A fluent example is:
ul()
.add(li("item 1"))
.add(li("item 2"))
.add(li("item 3"))
You can check more examples here
I also created an on line converter to transform every html snippet (from complex bootstrap template to simple single snippet) on the fly (i.e. html -> javatags)
Templates and other methods based on preliminary creation of the document in memory are likely to impose certain limits on resulting document size.
Meanwhile a very straightforward and reliable write-on-the-fly approach to creation of plain HTML exists, based on a SAX handler and default XSLT transformer, the latter having intrinsic capability of HTML output:
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("myfile.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
StreamResult streamResult = new StreamResult(writer);
SAXTransformerFactory saxFactory =
(SAXTransformerFactory) TransformerFactory.newInstance();
TransformerHandler tHandler = saxFactory.newTransformerHandler();
tHandler.setResult(streamResult);
Transformer transformer = tHandler.getTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.setOutputProperty(OutputKeys.ENCODING, encoding);
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
writer.write("<!DOCTYPE html>\n");
writer.flush();
tHandler.startDocument();
tHandler.startElement("", "", "html", new AttributesImpl());
tHandler.startElement("", "", "head", new AttributesImpl());
tHandler.startElement("", "", "title", new AttributesImpl());
tHandler.characters("Hello".toCharArray(), 0, 5);
tHandler.endElement("", "", "title");
tHandler.endElement("", "", "head");
tHandler.startElement("", "", "body", new AttributesImpl());
tHandler.startElement("", "", "p", new AttributesImpl());
tHandler.characters("5 > 3".toCharArray(), 0, 5); // note '>' character
tHandler.endElement("", "", "p");
tHandler.endElement("", "", "body");
tHandler.endElement("", "", "html");
tHandler.endDocument();
writer.close();
Note that XSLT transformer will release you from the burden of escaping special characters like >, as it takes necessary care of it by itself.
And it is easy to wrap SAX methods like startElement() and characters() to something more convenient to one's taste...
If you are willing to use Groovy, the MarkupBuilder is very convenient for this sort of thing, but I don't know that Java has anything like it.
http://groovy.codehaus.org/Creating+XML+using+Groovy's+MarkupBuilder
if it is becoming repetitive work ; i think you shud do code reuse ! why dont you simply write functions that "write" small building blocks of HTML. get the idea? see Eg. you can have a function to which you could pass a string and it would automatically put that into a paragraph tag and present it. Of course you would also need to write some kind of a basic parser to do this (how would the function know where to attach the paragraph!). i dont think you are a beginner .. so i am not elaborating ... do tell me if you do not understand..
Try the ujo-web library, which supports building HTML pages using the Element class. Here is a sample use case based on a Java servlet:
#WebServlet("/form-servlet")
public class FormServlet extends HttpServlet {
#Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws IOException {
try (HtmlElement html = HtmlElement.niceOf(response, "/style.css")) {
try (Element body = html.addBody()) {
body.addHeading("Simple form");
try (Element form = body.addForm("form-inline")) {
form.addLabel("control-label").addText("Note:");
form.addInput("form-control", "col-lg-1")
.setNameValue(NOTE, NOTE.of(request));
form.addSubmitButton("btn", "btn-primary")
.addText("Submit");
}
}
}
}
enum Attrib implements HttpParameter {
NOTE;
#Override public String toString() {
return name().toLowerCase();
}
}
}
The servlet generates the next HTML code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8"/>
<title>Demo</title>
<link href="/css/regexp.css" rel="stylesheet"/>
</head>
<body>
<h1>Simple form</h1>
<form class="form-inline">
<label class="control-label">Note:</label>
<input class="form-control col-lg-1" name="note" value=""/>
<button class="btn btn-primary" type="submit">Submit</button>
</form>
</body>
</html>
See more information here: https://ujorm.org/www/web/

Java String Encoding to UTF-8

I have some HTML code that I store in a Java.lang.String variable. I write that variable to a file and set the encoding to UTF-8 when writing the contents of the string variable to the file on the filesystem. I open up that file and everything looks great e.g. → shows up as a right arrow.
However, if the same String (containing the same content) is used by a jsp page to render content in a browser, characters such as → show up as a question mark (?)
When storing content in the String variable, I make sure that I use:
String myStr = new String(bytes[], charset)
instead of just:
String myStr = "<html><head/><body>→</body></html>";
Can someone please tell me why the String content gets written to the filesystem perfectly but does not render in the jsp/browser?
Thanks.
but does not render in the jsp/browser?
You need to set the response encoding as well. In a JSP you can do this using
<%# page pageEncoding="UTF-8" %>
This has actually the same effect as setting the following meta tag in HTML <head>:
<meta http-equiv="content-type" content="text/html; charset=utf-8">
Possibilities:
The browser does not support UTF-8
You don't have Content-Type: text/html; charset=utf-8 in your HTTP Headers.
The lazy developer (=me) uses Apache Common Lang StringEscapeUtils.escapeHtml http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringEscapeUtils.html#escapeHtml(java.lang.String) which will help you handle all 'odd' characters. Let the browser do the final translation of the html entities.

java utf-8 encding problem

i am using an HTML parser called HTMLCLEANER to parse HTML page
the problem is that each page has a different encoding than the other.
my question
Can i change from any character encoding to UTF-8?
You cannot seamlessly "convert" from encoding X to encoding Y without knowing encoding X beforehand. Just check the HTTP response header which encoding it is using (if you're obtaining those HTML pages by HTTP) and then use the appropriate encoding in your HTML parser tool.
Where do you get the HTML page from? If you get it from the servlet request, you can use getReader() on it and pass that to clean(). This will use the right encoding. If you get it from an upload, pass the input stream to clean(). If you get it by http client, you need to check the reponse header Content-Type using getResponseCharSet().
Can i change from any character
encoding to UTF-8?
Yes, you can express any Unicode character in UTF-8 encoding.
There might be a problem when changing the encoding of HTML pages: if the page contains an "charset" Meta-Tag, for example,
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
you have to update this tag so it corresponds to the actual encoding.
public void arreglarString(String cadena) {
for (int i = 161; i < 256; i++) {
char car = (char) i;
cadena = cadena.replaceAll(car + "", "&#" + i);
}
return cadena;
}

Categories

Resources