how to detect WebPage charset,and get page content?

how to detect WebPage charset,and get page content? - java

i use follows code to get page content:
URL url=new URL("http://www.google.com.hk/intl/zh-CN/privacy.html");
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openConnection().getInputStream()));
for(String line=reader.readLine();line!=null;line=reader.readLine()){
System.out.println(line);
}
reader.close();
page: http://www.google.com.hk/intl/zh-CN/privacy.html charset is "UTF-8",but my system default charset is "GBK",so, these code can't type right.
i know ,i can write a charsetname in InputStreamReader constructor:
new InputStreamReader(url.openConnection().getInputStream(),"UTF-8")
it's will be ok, but i want to know:
how to detect charset,and get page content? (not send two requests better)
any java library can do this? (get webpage content,and don't need set charsetname)
thanks for help :)

There is really no easy way to detect the proper charset. You can hope that the web page you are interested in declares the charset using a <meta charset="utf-8"> tag. When you detect that tag you could switch charset of your parsing.
There are also some libraries that make an effort to detect the charset, for example http://jchardet.sourceforge.net/.

Related

Java GB2312 string in HTML does not display correctly

I am trying to read in HTML from Chinese websites and get their <title> value. All the websites with UTF-8 encoding works fine, but not for GB2312 websites (for example, m.39.net, which shows 39������_�й����ȵĽ����Ż���վ instead of 39健康网_中国领先的健康门户网站).
Here is the code I use to accomplish that:
URL url = new URL(urlstr);
URLConnection connection = url.openConnection();
inputStream = connection.getInputStream();
String content = IOUtils.toString(inputStream);

String content = IOUtils.toString(inputStream, "GB2312"); may do the help.
If you want to detect the charset of a webpage, there are 3 ways as far as I know:
use connection.getContentEncoding() to get the charset described in the HTTP header;
parse <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"> or <meta charset="UTF-8"> in the HTML code (have to download the HTML content first and then read several lines);
use 3rd party libraries. E.g. those mentioned in this question.

Have you seen http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/IOUtils.html
toString(byte[] input, String encoding)

Why aren't UTF-8 characters being rendered correctly in this web page (generated with JSoup)?

I'm having trouble dealing with Charsets while parsing and rendering a page using the JSoup library. here is an example of the page it renders:
http://dl.dropbox.com/u/13093/charset-problem.html
As you can see, where there should be ' characters, ? is being rendered instead (even when you view the source).
This page is being generated by downloading a web page, parsing with JSoup, and then re-rendering it again having made some structural changes.
I'm downloading the page as follows:
final Document inputDoc = Jsoup.connect(sourceURL.toString()).get();
When I create the output document I do so as follows:
outputDoc.outputSettings().charset(Charset.forName("UTF-8"));
outputDoc.head().appendElement("meta").attr("charset", "UTF-8");
outputDoc.head().appendElement("meta").attr("http-equiv", "Content-Type")
.attr("content", "text/html; charset=UTF-8");
Can anyone offer suggestions as to what I'm doing wrong?
edit: Note that the source page is http://blog.locut.us/ and as you'll see, it appears to render correctly

The question marks are typical whenever you write characters to the outputstream of the response which are not covered by the response's character encoding. You seem to be relying on the platform default character encoding when serving the response. The response Content-Type header of your site also confirms this by a missing charset attribute.
Assuming that you're using a servlet to serve the modified HTML, then you should be using HttpServletResponse#setCharacterEncoding() to set the character encoding before writing the modified HTML out.
response.setCharacterEncoding("UTF-8");
response.getWriter().write(html);

The problem is most likely in reading the input page, you need to have the correct encoding for the source too.

retrieve and display Tamil characters from mysql database to Browser

I am using java language. I can store Tamil characters to database in the same format. But when I retrieve and display in the browser using jsp it displayed like boxes. I use the following code to save Tamil character in mysql database.
Properties pr = new Properties();
pr.put("user", "root");
pr.put("password", "root");
pr.put("characterEncoding", "UTF-8");
pr.put("useUnicode", "true");
Class.forName("com.mysql.jdbc.Driver");
connection = DriverManager.getConnection(connectionURL,pr);
I can see the Tamil characters in database. But I can't retrieve and display in the same format. Please Help me. Thanks in advance.

Add the following to the top of your JSPs:
<%# page pageEncoding="UTF-8" %>
It instructs the server to use UTF-8 to write the characters to the response. It also adds a HTTP response Content-Type header with a value of text/html;charset=UTF-8. This is quite different from a simple <meta> tag which is ignored by webbrowsers when the content is served over HTTP. For debugging purposes, you can see the real HTTP headers using for example Fiddler2 or Firebug.
That should be sufficient.

What is the character set of your JSP pages? Make sure that it is UTF-8.
<meta http-equiv="content-type" content="text/html; charset=UTF-8">

adarshr gave you the probable solution to your probleme, also you could read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.

converting a String to UTF8 format

I java code, I am having a string name = "örebro"; // its a swedish character.
But when I use this name in web application. I print some special character at 'Ö' character.
Is there anyway I can use the same character as it is in "örebro".
I did some thing like this but does not worked.
String name = "örebro";
byte[] utf8s = name .getBytes("UTF-8");
name = new String(utf8s, "UTF-8");
But the name at the end prints the same, something like this. �rebo
Please guide me

The Java code you've provided is pointless, it will do nothing. Java Strings are already perfectly capable of encoding any character (though you have to be careful with literals in the source code, as they depend on the encoding the compiler uses, which is platform-dependant).
Most likely your problem is that your webpage does not declare the encoding correctly in the HTTP header or the HTML meta tags.

You need to set the encoding of your output to UTF8.

It is likely the browser that reads the page does not know the encoding.
send the header (before any other output) something in Java like ServletResponse resource; (...)resource.setContentType ("text/html;charset=utf-8");
in your html page, mention the encoding by sending (printing)<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

If the page used to generate the output is jsp it's useful to precise
<%# page contentType="text/html; charset=utf-8" %>

Java String Encoding to UTF-8

I have some HTML code that I store in a Java.lang.String variable. I write that variable to a file and set the encoding to UTF-8 when writing the contents of the string variable to the file on the filesystem. I open up that file and everything looks great e.g. → shows up as a right arrow.
However, if the same String (containing the same content) is used by a jsp page to render content in a browser, characters such as → show up as a question mark (?)
When storing content in the String variable, I make sure that I use:
String myStr = new String(bytes[], charset)
instead of just:
String myStr = "<html><head/><body>→</body></html>";
Can someone please tell me why the String content gets written to the filesystem perfectly but does not render in the jsp/browser?
Thanks.

but does not render in the jsp/browser?
You need to set the response encoding as well. In a JSP you can do this using
<%# page pageEncoding="UTF-8" %>
This has actually the same effect as setting the following meta tag in HTML <head>:
<meta http-equiv="content-type" content="text/html; charset=utf-8">

Possibilities:
The browser does not support UTF-8
You don't have Content-Type: text/html; charset=utf-8 in your HTTP Headers.

The lazy developer (=me) uses Apache Common Lang StringEscapeUtils.escapeHtml http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringEscapeUtils.html#escapeHtml(java.lang.String) which will help you handle all 'odd' characters. Let the browser do the final translation of the html entities.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to detect WebPage charset,and get page content? - java

Related

Java GB2312 string in HTML does not display correctly

Why aren't UTF-8 characters being rendered correctly in this web page (generated with JSoup)?

retrieve and display Tamil characters from mysql database to Browser

converting a String to UTF8 format

Java String Encoding to UTF-8

Categories

Resources