Java GB2312 string in HTML does not display correctly - java

I am trying to read in HTML from Chinese websites and get their <title> value. All the websites with UTF-8 encoding works fine, but not for GB2312 websites (for example, m.39.net, which shows 39������_�й����ȵĽ����Ż���վ instead of 39健康网_中国领先的健康门户网站).
Here is the code I use to accomplish that:
URL url = new URL(urlstr);
URLConnection connection = url.openConnection();
inputStream = connection.getInputStream();
String content = IOUtils.toString(inputStream);

String content = IOUtils.toString(inputStream, "GB2312"); may do the help.
If you want to detect the charset of a webpage, there are 3 ways as far as I know:
use connection.getContentEncoding() to get the charset described in the HTTP header;
parse <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"> or <meta charset="UTF-8"> in the HTML code (have to download the HTML content first and then read several lines);
use 3rd party libraries. E.g. those mentioned in this question.

Have you seen http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/IOUtils.html
toString(byte[] input, String encoding)

Related

API design for Java String results which contain charset-specific data

In the API of a document converter, which generates HTML (or XHTML), I want to expose these methods:
// Convert the input file to a file using the specified charset
void convert(File in, File out, Charset charset);
// Convert the input document to a string using the specified charset
String convert(String in, Charset charset);
There is no way for client code to produce faulty documents with the file-based method, it safely writes a result document with the specified charset.
The String based method obviuously will lead to problems, if the client code does not respect the chosen charset - for example if the charset parameter is ISO-8859-1 but the result String is served as UTF-8 content in a web application:
String html = convert(getInputDocument(), ISO_8859_1);
...
response.setContentType("text/html;charset=UTF-8");
response.setCharacterEncoding("UTF-8");
try (PrintWriter out = response.getWriter()) {
out.print(html);
}
Question: which options should I consider to design the API so that users are guided to correct usage of the result string?
deprecate the method and provide a method which returns a byte array
use method names which contain the encoding (convertToUTF_8, convertToISO_8859_1 ...)
The result string could for example be
<!DOCTYPE html>
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Untitled document</title>
</head>
<body>
<p>Motörhead</p>
</body>
</html>
I don't know your exact use-case, but one possibility is to protect document with a proper object context (instead of it just being a String):
public interface Document {
void writeTo(ServletResponse response);
}
This way you can retain all control of how that "string" can be written to different targets.
I'm not sure whether you need a convert at all, since the document could automatically convert its content if it sees that the response already has a different encoding. But even if you need a convert you could do it this way:
public interface Document {
void writeTo(ServletResponse response);
Document convert(Charset targetCharset);
}
This would return a new document which is of a different charset.

retrieve and display Tamil characters from mysql database to Browser

I am using java language. I can store Tamil characters to database in the same format. But when I retrieve and display in the browser using jsp it displayed like boxes. I use the following code to save Tamil character in mysql database.
Properties pr = new Properties();
pr.put("user", "root");
pr.put("password", "root");
pr.put("characterEncoding", "UTF-8");
pr.put("useUnicode", "true");
Class.forName("com.mysql.jdbc.Driver");
connection = DriverManager.getConnection(connectionURL,pr);
I can see the Tamil characters in database. But I can't retrieve and display in the same format. Please Help me. Thanks in advance.
Add the following to the top of your JSPs:
<%# page pageEncoding="UTF-8" %>
It instructs the server to use UTF-8 to write the characters to the response. It also adds a HTTP response Content-Type header with a value of text/html;charset=UTF-8. This is quite different from a simple <meta> tag which is ignored by webbrowsers when the content is served over HTTP. For debugging purposes, you can see the real HTTP headers using for example Fiddler2 or Firebug.
That should be sufficient.
What is the character set of your JSP pages? Make sure that it is UTF-8.
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
adarshr gave you the probable solution to your probleme, also you could read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.

how to detect WebPage charset,and get page content?

i use follows code to get page content:
URL url=new URL("http://www.google.com.hk/intl/zh-CN/privacy.html");
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openConnection().getInputStream()));
for(String line=reader.readLine();line!=null;line=reader.readLine()){
System.out.println(line);
}
reader.close();
page: http://www.google.com.hk/intl/zh-CN/privacy.html charset is "UTF-8",but my system default charset is "GBK",so, these code can't type right.
i know ,i can write a charsetname in InputStreamReader constructor:
new InputStreamReader(url.openConnection().getInputStream(),"UTF-8")
it's will be ok, but i want to know:
how to detect charset,and get page content? (not send two requests better)
any java library can do this? (get webpage content,and don't need set charsetname)
thanks for help :)
There is really no easy way to detect the proper charset. You can hope that the web page you are interested in declares the charset using a <meta charset="utf-8"> tag. When you detect that tag you could switch charset of your parsing.
There are also some libraries that make an effort to detect the charset, for example http://jchardet.sourceforge.net/.

Java String Encoding to UTF-8

I have some HTML code that I store in a Java.lang.String variable. I write that variable to a file and set the encoding to UTF-8 when writing the contents of the string variable to the file on the filesystem. I open up that file and everything looks great e.g. → shows up as a right arrow.
However, if the same String (containing the same content) is used by a jsp page to render content in a browser, characters such as → show up as a question mark (?)
When storing content in the String variable, I make sure that I use:
String myStr = new String(bytes[], charset)
instead of just:
String myStr = "<html><head/><body>→</body></html>";
Can someone please tell me why the String content gets written to the filesystem perfectly but does not render in the jsp/browser?
Thanks.
but does not render in the jsp/browser?
You need to set the response encoding as well. In a JSP you can do this using
<%# page pageEncoding="UTF-8" %>
This has actually the same effect as setting the following meta tag in HTML <head>:
<meta http-equiv="content-type" content="text/html; charset=utf-8">
Possibilities:
The browser does not support UTF-8
You don't have Content-Type: text/html; charset=utf-8 in your HTTP Headers.
The lazy developer (=me) uses Apache Common Lang StringEscapeUtils.escapeHtml http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringEscapeUtils.html#escapeHtml(java.lang.String) which will help you handle all 'odd' characters. Let the browser do the final translation of the html entities.

java utf-8 encding problem

i am using an HTML parser called HTMLCLEANER to parse HTML page
the problem is that each page has a different encoding than the other.
my question
Can i change from any character encoding to UTF-8?
You cannot seamlessly "convert" from encoding X to encoding Y without knowing encoding X beforehand. Just check the HTTP response header which encoding it is using (if you're obtaining those HTML pages by HTTP) and then use the appropriate encoding in your HTML parser tool.
Where do you get the HTML page from? If you get it from the servlet request, you can use getReader() on it and pass that to clean(). This will use the right encoding. If you get it from an upload, pass the input stream to clean(). If you get it by http client, you need to check the reponse header Content-Type using getResponseCharSet().
Can i change from any character
encoding to UTF-8?
Yes, you can express any Unicode character in UTF-8 encoding.
There might be a problem when changing the encoding of HTML pages: if the page contains an "charset" Meta-Tag, for example,
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
you have to update this tag so it corresponds to the actual encoding.
public void arreglarString(String cadena) {
for (int i = 161; i < 256; i++) {
char car = (char) i;
cadena = cadena.replaceAll(car + "", "&#" + i);
}
return cadena;
}

Categories

Resources