UTF-8 encoding in JLabel on Windows

UTF-8 encoding in JLabel on Windows - java

I have a problem with encoding in JLabel on Windows(on *nix OSes everything is okay).
Here's an image: http://i.imgur.com/DEkj3.png (the problematic character is the L with ` on the top, it should be ł) and here the code:
public void run()
{
URL url;
HttpURLConnection conn;
BufferedReader rd;
String line;
String result = "";
try {
url = new URL(URL);
conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = rd.readLine()) != null) {
result += line;
}
rd.close();
} catch (Exception e) {
try
{
throw e;
}
catch (Exception e1)
{
Window.news.setText("");
}
}
Window.news.setText(result);
}
I've tried Window.news.setText(new String(result.getBytes(), "UTF-8"));, but it hasn't helped. Maybe I need to run my application with specified JVM flags?

You are breaking the data before it gets to the window when you use new InputStreamReader with no explicit charset. this will use the platform default charset, which is probably cp1252 on windows, hence your broken characters.
if you know the charset of the data you are reading, you should specify it explicitly, e.g.:
new InputStreamReader(conn.getInputStream(), "UTF-8")
in the case of downloading data from an arbitrary url, however, you should probably be preferring the charset in the 'Content-Type' header, if present.

Related

Http response code 429 while reading HTML

In java I want to read and save all the HTML from an URL(instagram), but getting Error 429 (Too many request). I think it is because I am trying to read more lines than request limits.
StringBuilder contentBuilder = new StringBuilder();
try {
URL url = new URL("https://www.instagram.com/username");
URLConnection con = url.openConnection();
InputStream is =con.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(is));
String str;
while ((str = in.readLine()) != null) {
contentBuilder.append(str);
}
in.close();
} catch (IOException e) {
log.warn("Could not connect", e);
}
String html = contentBuilder.toString();
And the Error is so;
Could not connect
java.io.IOException: Server returned HTTP response code: 429 for URL: https://www.instagram.com/username/
And it shows also that error occurs because of this line
InputStream is =con.getInputStream();
Does anybody have an idea why I get this error and/or what to do to solve it?

The problem might have been caused by the connection not being closed/disconnected.
For the input try-with-resources for automatic closing, even on exception or return is usefull too. Also you constructed an InputStreamReader that would use the default encoding of the machine where the application would run, but you need the charset of the URL's content.
readLine returns the line without line-endings (which in general is very useful). So add one.
StringBuilder contentBuilder = new StringBuilder();
try {
URL url = new URL("https://www.instagram.com/username");
URLConnection con = url.openConnection();
try (BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream(), "UTF-8"))) {
String line;
while ((line = in.readLine()) != null) {
contentBuilder.append(line).append("\r\n");
}
} finally {
con.disconnect();
} // Closes in.
} catch (IOException e) {
log.warn("Could not connect", e);
}
String html = contentBuilder.toString();

Cannot get URL content as UTF-8

i'm trying to read content from a URL but it does return strange symbols instead of "è", "à", etc.
This is the code i'm using:
public static String getPageContent(String _url) {
URL url;
InputStream is = null;
BufferedReader dis;
String line;
String text = "";
try {
url = new URL(_url);
is = url.openStream();
//This line should open the stream as UTF-8
dis = new BufferedReader(new InputStreamReader(is, "UTF-8"));
while ((line = dis.readLine()) != null) {
text += line + "\n";
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe) {
// nothing to see here
}
}
return text;
}
I saw other questions like this, and all of them were answered like
Declare your inputstream as
new InputStreamReader(is, "UTF-8")
But i can't get it to work.
For example, if my url content contains
è uno dei più
I get
Ã¨ uno dei piÃ¹
What am i missing?

Judging by your example. You do receive a multibyte UTF-8 byte stream but your text editor reads in as ISO-8859-1. Tell your editor to read bytes as UTF-8!

I don't really know why this should not work, however the Java 7 way would be to use StandardCharsets.UTF_8 see
http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html
in the (new) Constructor InputStreamReader(InputStream in, Charset cs), see
http://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html.

Encoding in exported JAR

I have problem with encoding in Java. I have set encoding in eclipse UTF-8. When I run my app from eclipse everything is ok but when I exported to jar and run it by double clicking I have ???? characters. When I run from command line: java -jar app.jar everything is ok. The problem is with downloaded data from other site (the site is utf8 encoded). What's the solution ?
EDIT:
On all platforms, when I run from command line the defaultEncoding() is UTF-8. But when I run by double clicking:
Mac: US-ASCII
Windows: windows-1250
I have wrote encoding method but it still not working:
public String getPageContent(String url) throws MalformedURLException, IOException
{
URL urlReader;
InputStream response = null;
BufferedReader reader;
String pageContent = "";
urlReader = new URL(url);
response = urlReader.openStream();
reader = new BufferedReader(new InputStreamReader(response));
for (String line; (line = reader.readLine()) != null;) {
pageContent += this.encode(line, "UTF-8");
}
reader.close();
return pageContent;
}
public String encode(String s, String charset)
{
try {
byte[] b = s.getBytes(charset);
s = new String(b, charset);
return s;
} catch (UnsupportedEncodingException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
return s;
}

You need to specify the UTF-8 character set when you construct the InputStreamReader.
reader = new BufferedReader(new InputStreamReader(response, "UTF-8"));
You shouldn't be trying to re-encode strings after having received them at all.

Setting the default Java character encoding?
Here is already a discussed thread with more details. Hope this help

How to read compressed HTML page with Content-Encoding : gzip

I request a web page that sends a Content-Encoding: gzip header, but got stuck how to read it..
My code:
try {
URLConnection connection = new URL("http://jquery.org").openConnection();
String html = "";
BufferedReader in = null;
connection.setReadTimeout(10000);
in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null){
html+=inputLine+"\n";
}
in.close();
System.out.println(html);
System.exit(0);
} catch (IOException ex) {
Logger.getLogger(Crawler.class.getName()).log(Level.SEVERE, null, ex);
}
The output looks very messy.. (I was unable to paste it here, a sort of symbols..)
I believe this is a compressed content, how to parse it?
Note:
If I change jquery.org to jquery.com (which don't send that header, my code works well)

Actually, this is pb2q's answer, but I post the full code for future readers
try {
URLConnection connection = new URL("http://jquery.org").openConnection();
String html = "";
BufferedReader in = null;
connection.setReadTimeout(10000);
//The changed part
if (connection.getHeaderField("Content-Encoding")!=null && connection.getHeaderField("Content-Encoding").equals("gzip")){
in = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));
} else {
in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
}
//End
String inputLine;
while ((inputLine = in.readLine()) != null){
html+=inputLine+"\n";
}
in.close();
System.out.println(html);
System.exit(0);
} catch (IOException ex) {
Logger.getLogger(Crawler.class.getName()).log(Level.SEVERE, null, ex);
}

There is a class for this: GZIPInputStream. It is an InputStream and so is very transparent to use.

there are two cases with Content-Encoding:gzip header
if data already compressed(by application), Content-Encoding:gizp header will cause data to compressed again.so its double compressed.it's because http compression
if data is not compressed by application, Content-Encoding:gizp will cause data to compress(gzip mostly) and it will automatically uncompressed(un-zip) before it reaches to client. un-zip is default feature available in most of web browsers. browser will do un-zip if it finds Content-Encoding:gizp header in the response.

Java UTF-8 encoding not set to URLConnection

I'm trying to retrieve data from http://api.freebase.com/api/trans/raw/m/0h47
As you can see in text there are sings like this: /ælˈdʒɪəriə/.
When I try to get source from the page I get text with sings like ú etc.
So far I've tried with the following code:
urlConnection.setRequestProperty("Accept-Charset", "UTF-8");
urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=utf-8");
What am I doing wrong?
My entire code:
URL url = null;
URLConnection urlConn = null;
DataInputStream input = null;
try {
url = new URL("http://api.freebase.com/api/trans/raw/m/0h47");
} catch (MalformedURLException e) {e.printStackTrace();}
try {
urlConn = url.openConnection();
} catch (IOException e) { e.printStackTrace(); }
urlConn.setRequestProperty("Accept-Charset", "UTF-8");
urlConn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");
urlConn.setDoInput(true);
urlConn.setUseCaches(false);
StringBuffer strBseznam = new StringBuffer();
if (strBseznam.length() > 0)
strBseznam.deleteCharAt(strBseznam.length() - 1);
try {
input = new DataInputStream(urlConn.getInputStream());
} catch (IOException e) { e.printStackTrace(); }
String str = "";
StringBuffer strB = new StringBuffer();
strB.setLength(0);
try {
while (null != ((str = input.readLine())))
{
strB.append(str);
}
input.close();
} catch (IOException e) { e.printStackTrace(); }

The HTML page is in UTF-8, and could use arabic characters and such. But those characters above Unicode 127 are still encoded as numeric entities like ú. An Accept-Encoding will not, help, and loading as UTF-8 is entirely right.
You have to decode the entities yourself. Something like:
String decodeNumericEntities(String s) {
StringBuffer sb = new StringBuffer();
Matcher m = Pattern.compile("\\&#(\\d+);").matcher(s);
while (m.find()) {
int uc = Integer.parseInt(m.group(1));
m.appendReplacement(sb, "");
sb.appendCodepoint(uc);
}
m.appendTail(sb);
return sb.toString();
}
By the way those entities could stem from processed HTML forms, so on the editing side of the web app.
After code in question:
I have replaced DataInputStream with a (Buffered)Reader for text. InputStreams read binary data, bytes; Readers text, Strings. An InputStreamReader has as parameter an InputStream and an encoding, and returns a Reader.
try {
BufferedReader input = new BufferedReader(
new InputStreamReader(urlConn.getInputStream(), "UTF-8"));
StringBuilder strB = new StringBuilder();
String str;
while (null != (str = input.readLine())) {
strB.append(str).append("\r\n");
}
input.close();
} catch (IOException e) {
e.printStackTrace();
}

Try adding also the user agent to your URLConnection:
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36");
This solved my decoding problem like a charm.

Well I'm thinking the problem is when you are reading from the stream. You should either call the readUTF method on the DataInputStream instead of calling readLine or, what I would do, would be to create an InputStreamReader and set the encoding, then you can read from the BufferedReader line by line (this would be inside your existing try/catch):
Charset charset = Charset.forName("UTF8");
InputStreamReader stream = new InputStreamReader(urlConn.getInputStream(), charset);
BufferedReader reader = new BufferedReader(stream);
StringBuffer responseBuffer = new StringBuffer();
String read = "";
while ((read = reader.readLine()) != null) {
responseBuffer.append(read);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

UTF-8 encoding in JLabel on Windows - java

Related

Http response code 429 while reading HTML

Cannot get URL content as UTF-8

Encoding in exported JAR

How to read compressed HTML page with Content-Encoding : gzip

Java UTF-8 encoding not set to URLConnection

Categories

Resources