I am trying to read in HTML from Chinese websites and get their <title> value. All the websites with UTF-8 encoding works fine, but not for GB2312 websites (for example, m.39.net, which shows 39������_�й����ȵĽ����Ż���վ instead of 39健康网_中国领先的健康门户网站).
Here is the code I use to accomplish that:
URL url = new URL(urlstr);
URLConnection connection = url.openConnection();
inputStream = connection.getInputStream();
String content = IOUtils.toString(inputStream);
String content = IOUtils.toString(inputStream, "GB2312"); may do the help.
If you want to detect the charset of a webpage, there are 3 ways as far as I know:
use connection.getContentEncoding() to get the charset described in the HTTP header;
parse <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"> or <meta charset="UTF-8"> in the HTML code (have to download the HTML content first and then read several lines);
use 3rd party libraries. E.g. those mentioned in this question.
Have you seen http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/IOUtils.html
toString(byte[] input, String encoding)
It is very good practice in HTML template engines to HTML-escape by default placeholder text to help prevent XSS (cross-site scripting) attacks. Is it possible to achieve this behavior in StringTemplate?
I have tried to register custom AttributeRenderer which escapes HTML unless format is "raw":
stg.registerRenderer(String.class, new AttributeRenderer() {
#Override
public String toString(Object o, String format, Locale locale) {
String s = (String)o;
return Objects.equals(format, "raw") ? s : StringEscapeUtils.escapeHtml4(s);
}
});
But it fails because in this case StringTemlate escapes not only placeholder text but also template text itself. For example this template:
example(title, content) ::= <<
<html>
<head>
<title>$title$</title>
</head>
<body>
$content; format = "raw"$
</body>
</html>
>>
Is rendered as:
<html>
<head>
<title>Example Title</title>
</head>
<body>
<p>Not escaped because of <code>format = "raw"</code>.</p>
</body>
</html>
Can anybody help?
There is no good solution to encode by default. The template is passed through the AttributeRenderer for the string data type, and there is no context information to detect if it is processing the template or a variable. So all strings, including the template, are encoded by default since you cannot specify "raw" for the template.
An alternative solution is to use format="xml-encode" in the variables that need to be encoded. The built-in StringRenderer has support for several formats:
upper
lower
cap
url-encode
xml-encode
So your example would be:
example(title, content) ::= <<
<html>
<head>
<title>$title; format="xml-encode"$</title>
</head>
<body>
$content$
</body>
</html>
>>
In order to encode by default, you have limited options. The alternatives are:
Use a custom data type (not String) for your variables, so you can register your HtmlEscapeStringRenderer for the custom data type. This is difficult if you use complex objects as variables that are already using standard strings.
Add the raw and the escaped variables to the model manually, e.g. add title (escaped) and title_raw (raw). You do not need a custom AttributeRenderer in this case. StringTemplate has a strict view/model separation and you need to have the model populated before it is rendered with both the raw and escaped values.
Neither option is particularly desirable, but I do not see any other alternatives with StringTemplate4.
The answer is to revert to StringTemplate v3.
I want my Java application to write HTML code in a file. Right now, I am hard coding HTML tags using java.io.BufferedWriter class. For Example:
BufferedWriter bw = new BufferedWriter(new FileWriter(file));
bw.write("<html><head><title>New Page</title></head><body><p>This is Body</p></body></html>");
bw.close();
Is there any easier way to do this, as I have to create tables and it is becoming very inconvenient?
If you want to do that yourself, without using any external library, a clean way would be to create a template.html file with all the static content, like for example:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>$title</title>
</head>
<body>$body
</body>
</html>
Put a tag like $tag for any dynamic content and then do something like this:
File htmlTemplateFile = new File("path/template.html");
String htmlString = FileUtils.readFileToString(htmlTemplateFile);
String title = "New Page";
String body = "This is Body";
htmlString = htmlString.replace("$title", title);
htmlString = htmlString.replace("$body", body);
File newHtmlFile = new File("path/new.html");
FileUtils.writeStringToFile(newHtmlFile, htmlString);
Note: I used org.apache.commons.io.FileUtils for simplicity.
A few months ago I had the same problem and every library I found provides too much functionality and complexity for my final goal. So I end up developing my own library - HtmlFlow - that provides a very simple and intuitive API that allows me to write HTML in a fluent style. Check it here: https://github.com/fmcarvalho/HtmlFlow (it also supports dynamic binding to HTML elements)
Here is an example of binding the properties of a Task object into HTML elements. Consider a Task Java class with three properties: Title, Description and a Priority and then we can produce an HTML document for a Task object in the following way:
import htmlflow.HtmlView;
import model.Priority;
import model.Task;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
public class App {
private static HtmlView<Task> taskDetailsView(){
HtmlView<Task> taskView = new HtmlView<>();
taskView
.head()
.title("Task Details")
.linkCss("https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css");
taskView
.body().classAttr("container")
.heading(1, "Task Details")
.hr()
.div()
.text("Title: ").text(Task::getTitle)
.br()
.text("Description: ").text(Task::getDescription)
.br()
.text("Priority: ").text(Task::getPriority);
return taskView;
}
public static void main(String [] args) throws IOException{
HtmlView<Task> taskView = taskDetailsView();
Task task = new Task("Special dinner", "Have dinner with someone!", Priority.Normal);
try(PrintStream out = new PrintStream(new FileOutputStream("Task.html"))){
taskView.setPrintStream(out).write(task);
Desktop.getDesktop().browse(URI.create("Task.html"));
}
}
}
You can use jsoup or wffweb (HTML5) based.
Sample code for jsoup:-
Document doc = Jsoup.parse("<html></html>");
doc.body().addClass("body-styles-cls");
doc.body().appendElement("div");
System.out.println(doc.toString());
prints
<html>
<head></head>
<body class=" body-styles-cls">
<div></div>
</body>
</html>
Sample code for wffweb:-
Html html = new Html(null) {{
new Head(this);
new Body(this,
new ClassAttribute("body-styles-cls"));
}};
Body body = TagRepository.findOneTagAssignableToTag(Body.class, html);
body.appendChild(new Div(null));
System.out.println(html.toHtmlString());
//directly writes to file
html.toOutputStream(new FileOutputStream("/home/user/filepath/filename.html"), "UTF-8");
prints (in minified format):-
<html>
<head></head>
<body class="body-styles-cls">
<div></div>
</body>
</html>
Velocity is a good candidate for writing this kind of stuff.
It allows you to keep your html and data-generation code as separated as possible.
I would highly recommend you use a very simple templating language such as Freemarker
It really depends on the type of HTML file you're creating.
For such tasks, I use to create an object, serialize it to XML, then transform it with XSL. The pros of this approach are:
The strict separation between source code and HTML template,
The possibility to edit HTML without having to recompile the application,
The ability to serve different HTML in different cases based on the same XML, or even serve XML directly when needed (for a further deserialization for example),
The shorter amount of code to write.
The cons are:
You must know XSLT and know how to implement it in Java.
You must write XSLT (and it's torture for many developers).
When transforming XML to HTML with XSLT, some parts may be tricky. Few examples: <textarea/> tags (which make the page unusable), XML declaration (which can cause problems with IE), whitespace (with <pre></pre> tags etc.), HTML entities ( ), etc.
The performance will be reduced, since serialization to XML wastes lots of CPU resources and XSL transformation is very costly too.
Now, if your HTML is very short or very repetitive or if the HTML has a volatile structure which changes dynamically, this approach must not be taken in account. On the other hand, if you serve HTML files which have all a similar structure and you want to reduce the amount of Java code and use templates, this approach may work.
I had also problems in finding something simple to satisfy my needs so I decided to write my own library (with MIT license).
It's mainly based on composite and builder pattern.
A basic declarative example is:
import static com.github.manliogit.javatags.lang.HtmlHelper.*;
html5(attr("lang -> en"),
head(
meta(attr("http-equiv -> Content-Type", "content -> text/html; charset=UTF-8")),
title("title"),
link(attr("href -> xxx.css", "rel -> stylesheet"))
)
).render();
A fluent example is:
ul()
.add(li("item 1"))
.add(li("item 2"))
.add(li("item 3"))
You can check more examples here
I also created an on line converter to transform every html snippet (from complex bootstrap template to simple single snippet) on the fly (i.e. html -> javatags)
Templates and other methods based on preliminary creation of the document in memory are likely to impose certain limits on resulting document size.
Meanwhile a very straightforward and reliable write-on-the-fly approach to creation of plain HTML exists, based on a SAX handler and default XSLT transformer, the latter having intrinsic capability of HTML output:
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("myfile.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
StreamResult streamResult = new StreamResult(writer);
SAXTransformerFactory saxFactory =
(SAXTransformerFactory) TransformerFactory.newInstance();
TransformerHandler tHandler = saxFactory.newTransformerHandler();
tHandler.setResult(streamResult);
Transformer transformer = tHandler.getTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.setOutputProperty(OutputKeys.ENCODING, encoding);
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
writer.write("<!DOCTYPE html>\n");
writer.flush();
tHandler.startDocument();
tHandler.startElement("", "", "html", new AttributesImpl());
tHandler.startElement("", "", "head", new AttributesImpl());
tHandler.startElement("", "", "title", new AttributesImpl());
tHandler.characters("Hello".toCharArray(), 0, 5);
tHandler.endElement("", "", "title");
tHandler.endElement("", "", "head");
tHandler.startElement("", "", "body", new AttributesImpl());
tHandler.startElement("", "", "p", new AttributesImpl());
tHandler.characters("5 > 3".toCharArray(), 0, 5); // note '>' character
tHandler.endElement("", "", "p");
tHandler.endElement("", "", "body");
tHandler.endElement("", "", "html");
tHandler.endDocument();
writer.close();
Note that XSLT transformer will release you from the burden of escaping special characters like >, as it takes necessary care of it by itself.
And it is easy to wrap SAX methods like startElement() and characters() to something more convenient to one's taste...
If you are willing to use Groovy, the MarkupBuilder is very convenient for this sort of thing, but I don't know that Java has anything like it.
http://groovy.codehaus.org/Creating+XML+using+Groovy's+MarkupBuilder
if it is becoming repetitive work ; i think you shud do code reuse ! why dont you simply write functions that "write" small building blocks of HTML. get the idea? see Eg. you can have a function to which you could pass a string and it would automatically put that into a paragraph tag and present it. Of course you would also need to write some kind of a basic parser to do this (how would the function know where to attach the paragraph!). i dont think you are a beginner .. so i am not elaborating ... do tell me if you do not understand..
Try the ujo-web library, which supports building HTML pages using the Element class. Here is a sample use case based on a Java servlet:
#WebServlet("/form-servlet")
public class FormServlet extends HttpServlet {
#Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws IOException {
try (HtmlElement html = HtmlElement.niceOf(response, "/style.css")) {
try (Element body = html.addBody()) {
body.addHeading("Simple form");
try (Element form = body.addForm("form-inline")) {
form.addLabel("control-label").addText("Note:");
form.addInput("form-control", "col-lg-1")
.setNameValue(NOTE, NOTE.of(request));
form.addSubmitButton("btn", "btn-primary")
.addText("Submit");
}
}
}
}
enum Attrib implements HttpParameter {
NOTE;
#Override public String toString() {
return name().toLowerCase();
}
}
}
The servlet generates the next HTML code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8"/>
<title>Demo</title>
<link href="/css/regexp.css" rel="stylesheet"/>
</head>
<body>
<h1>Simple form</h1>
<form class="form-inline">
<label class="control-label">Note:</label>
<input class="form-control col-lg-1" name="note" value=""/>
<button class="btn btn-primary" type="submit">Submit</button>
</form>
</body>
</html>
See more information here: https://ujorm.org/www/web/
I have some HTML code that I store in a Java.lang.String variable. I write that variable to a file and set the encoding to UTF-8 when writing the contents of the string variable to the file on the filesystem. I open up that file and everything looks great e.g. → shows up as a right arrow.
However, if the same String (containing the same content) is used by a jsp page to render content in a browser, characters such as → show up as a question mark (?)
When storing content in the String variable, I make sure that I use:
String myStr = new String(bytes[], charset)
instead of just:
String myStr = "<html><head/><body>→</body></html>";
Can someone please tell me why the String content gets written to the filesystem perfectly but does not render in the jsp/browser?
Thanks.
but does not render in the jsp/browser?
You need to set the response encoding as well. In a JSP you can do this using
<%# page pageEncoding="UTF-8" %>
This has actually the same effect as setting the following meta tag in HTML <head>:
<meta http-equiv="content-type" content="text/html; charset=utf-8">
Possibilities:
The browser does not support UTF-8
You don't have Content-Type: text/html; charset=utf-8 in your HTTP Headers.
The lazy developer (=me) uses Apache Common Lang StringEscapeUtils.escapeHtml http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringEscapeUtils.html#escapeHtml(java.lang.String) which will help you handle all 'odd' characters. Let the browser do the final translation of the html entities.
i am using an HTML parser called HTMLCLEANER to parse HTML page
the problem is that each page has a different encoding than the other.
my question
Can i change from any character encoding to UTF-8?
You cannot seamlessly "convert" from encoding X to encoding Y without knowing encoding X beforehand. Just check the HTTP response header which encoding it is using (if you're obtaining those HTML pages by HTTP) and then use the appropriate encoding in your HTML parser tool.
Where do you get the HTML page from? If you get it from the servlet request, you can use getReader() on it and pass that to clean(). This will use the right encoding. If you get it from an upload, pass the input stream to clean(). If you get it by http client, you need to check the reponse header Content-Type using getResponseCharSet().
Can i change from any character
encoding to UTF-8?
Yes, you can express any Unicode character in UTF-8 encoding.
There might be a problem when changing the encoding of HTML pages: if the page contains an "charset" Meta-Tag, for example,
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
you have to update this tag so it corresponds to the actual encoding.
public void arreglarString(String cadena) {
for (int i = 161; i < 256; i++) {
char car = (char) i;
cadena = cadena.replaceAll(car + "", "&#" + i);
}
return cadena;
}