Java UTF-8 encoding not set to URLConnection

Java UTF-8 encoding not set to URLConnection - java

I'm trying to retrieve data from http://api.freebase.com/api/trans/raw/m/0h47
As you can see in text there are sings like this: /ælˈdʒɪəriə/.
When I try to get source from the page I get text with sings like ú etc.
So far I've tried with the following code:
urlConnection.setRequestProperty("Accept-Charset", "UTF-8");
urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=utf-8");
What am I doing wrong?
My entire code:
URL url = null;
URLConnection urlConn = null;
DataInputStream input = null;
try {
url = new URL("http://api.freebase.com/api/trans/raw/m/0h47");
} catch (MalformedURLException e) {e.printStackTrace();}
try {
urlConn = url.openConnection();
} catch (IOException e) { e.printStackTrace(); }
urlConn.setRequestProperty("Accept-Charset", "UTF-8");
urlConn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");
urlConn.setDoInput(true);
urlConn.setUseCaches(false);
StringBuffer strBseznam = new StringBuffer();
if (strBseznam.length() > 0)
strBseznam.deleteCharAt(strBseznam.length() - 1);
try {
input = new DataInputStream(urlConn.getInputStream());
} catch (IOException e) { e.printStackTrace(); }
String str = "";
StringBuffer strB = new StringBuffer();
strB.setLength(0);
try {
while (null != ((str = input.readLine())))
{
strB.append(str);
}
input.close();
} catch (IOException e) { e.printStackTrace(); }

The HTML page is in UTF-8, and could use arabic characters and such. But those characters above Unicode 127 are still encoded as numeric entities like ú. An Accept-Encoding will not, help, and loading as UTF-8 is entirely right.
You have to decode the entities yourself. Something like:
String decodeNumericEntities(String s) {
StringBuffer sb = new StringBuffer();
Matcher m = Pattern.compile("\\&#(\\d+);").matcher(s);
while (m.find()) {
int uc = Integer.parseInt(m.group(1));
m.appendReplacement(sb, "");
sb.appendCodepoint(uc);
}
m.appendTail(sb);
return sb.toString();
}
By the way those entities could stem from processed HTML forms, so on the editing side of the web app.
After code in question:
I have replaced DataInputStream with a (Buffered)Reader for text. InputStreams read binary data, bytes; Readers text, Strings. An InputStreamReader has as parameter an InputStream and an encoding, and returns a Reader.
try {
BufferedReader input = new BufferedReader(
new InputStreamReader(urlConn.getInputStream(), "UTF-8"));
StringBuilder strB = new StringBuilder();
String str;
while (null != (str = input.readLine())) {
strB.append(str).append("\r\n");
}
input.close();
} catch (IOException e) {
e.printStackTrace();
}

Try adding also the user agent to your URLConnection:
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36");
This solved my decoding problem like a charm.

Well I'm thinking the problem is when you are reading from the stream. You should either call the readUTF method on the DataInputStream instead of calling readLine or, what I would do, would be to create an InputStreamReader and set the encoding, then you can read from the BufferedReader line by line (this would be inside your existing try/catch):
Charset charset = Charset.forName("UTF8");
InputStreamReader stream = new InputStreamReader(urlConn.getInputStream(), charset);
BufferedReader reader = new BufferedReader(stream);
StringBuffer responseBuffer = new StringBuffer();
String read = "";
while ((read = reader.readLine()) != null) {
responseBuffer.append(read);
}

Related

Http response code 429 while reading HTML

In java I want to read and save all the HTML from an URL(instagram), but getting Error 429 (Too many request). I think it is because I am trying to read more lines than request limits.
StringBuilder contentBuilder = new StringBuilder();
try {
URL url = new URL("https://www.instagram.com/username");
URLConnection con = url.openConnection();
InputStream is =con.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(is));
String str;
while ((str = in.readLine()) != null) {
contentBuilder.append(str);
}
in.close();
} catch (IOException e) {
log.warn("Could not connect", e);
}
String html = contentBuilder.toString();
And the Error is so;
Could not connect
java.io.IOException: Server returned HTTP response code: 429 for URL: https://www.instagram.com/username/
And it shows also that error occurs because of this line
InputStream is =con.getInputStream();
Does anybody have an idea why I get this error and/or what to do to solve it?

The problem might have been caused by the connection not being closed/disconnected.
For the input try-with-resources for automatic closing, even on exception or return is usefull too. Also you constructed an InputStreamReader that would use the default encoding of the machine where the application would run, but you need the charset of the URL's content.
readLine returns the line without line-endings (which in general is very useful). So add one.
StringBuilder contentBuilder = new StringBuilder();
try {
URL url = new URL("https://www.instagram.com/username");
URLConnection con = url.openConnection();
try (BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream(), "UTF-8"))) {
String line;
while ((line = in.readLine()) != null) {
contentBuilder.append(line).append("\r\n");
}
} finally {
con.disconnect();
} // Closes in.
} catch (IOException e) {
log.warn("Could not connect", e);
}
String html = contentBuilder.toString();

Download part of an url content to save bandwith

I have an huge text file online, I know how to fetch the data in the url... an example would be something like this
URL url = new URL(address);
urlConnection = (HttpURLConnection) url.openConnection();
int responseCode = urlConnection.getResponseCode();
if(responseCode == HttpURLConnection.HTTP_OK) {
InputStream stream = urlConnection.getInputStream();
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(stream));
StringBuilder builder = new StringBuilder();
try {
for (String line; (line = bufferedReader.readLine()) != null;)
builder.append(line);
response = builder.toString();
} catch (IOException e) {
e.printStackTrace();
}
}
This file get updated every X minutes and a new line is added to the bottom, so the real info is only in the last line/lines... I was wondering if it would be possible to download only that part, to save bandwith.
Edit: I am looking for a "client-side" solution, without modifying server
Thank you very much.

Android HTTP Request Encoding

I want to do a HTTPRequest in my Android App, using the following Code:
BufferedReader in = null;
try {
HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet();
request.setURI(new URI("http://www.example.de/example.php"));
HttpResponse response = client.execute(request);
in = new BufferedReader
(new InputStreamReader(response.getEntity().getContent()));
StringBuffer sb = new StringBuffer("");
String line = "";
String NL = System.getProperty("line.separator");
while ((line = in.readLine()) != null) {
sb.append(line + NL);
}
in.close();
String page = sb.toString();
System.out.println(page);
return page;
} finally {
if (in != null) {
try {
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
The webpage I'm calling is a php Script which returns a string. My problem is that the the special Characters (ä,ü,ö,€ etc.) are showed as a Question mark with a box. How can I get these characters?
I think it's a problem with the encoding (German App -> UTF-8?).

May be you could try to set encoding when displaying into the console. Something characters are correctly returned from the server but fails to display in the console.
String page = sb.toString();
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(page);

I have played around with your code, against http://www.google.de.
I was able to "hack" something, not sure it's the most elegant solution though.
After the line:
HttpResponse response = client.execute(request);
... I've added:
HttpEntity e = response.getEntity();
Header ct = e.getContentType();
HeaderElement[] he = ct.getElements();
if (
he.length > 0
&& he[0].getParameters().length > 0
&& he[0].getParameter(0) != null
&& he[0].getParameter(0).getName().equals("charset")
) {
String charset = he[0].getParameter(0).getValue();
// with google.de, will print ISO latin ("ISO-8859-1")
Log.d("com.example.test", charset);
}
... then you can add the charset representation, or its Java equivalent as a second argument of your InputStreamReader constructor call:
in = new BufferedReader(
new InputStreamReader(
response.getEntity().getContent(),
charset != null ? charset : "UTF-8"
);
Let me know if that works out for you.
Also note that in order to check Java charset equivalences, you could use Charset.forName(String charsetName) and catch the relevant Exceptions (and then revert to Charset.defaultCharset() or UTF-8, etc. in your catch statement).

Cannot get URL content as UTF-8

i'm trying to read content from a URL but it does return strange symbols instead of "è", "à", etc.
This is the code i'm using:
public static String getPageContent(String _url) {
URL url;
InputStream is = null;
BufferedReader dis;
String line;
String text = "";
try {
url = new URL(_url);
is = url.openStream();
//This line should open the stream as UTF-8
dis = new BufferedReader(new InputStreamReader(is, "UTF-8"));
while ((line = dis.readLine()) != null) {
text += line + "\n";
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe) {
// nothing to see here
}
}
return text;
}
I saw other questions like this, and all of them were answered like
Declare your inputstream as
new InputStreamReader(is, "UTF-8")
But i can't get it to work.
For example, if my url content contains
è uno dei più
I get
Ã¨ uno dei piÃ¹
What am i missing?

Judging by your example. You do receive a multibyte UTF-8 byte stream but your text editor reads in as ISO-8859-1. Tell your editor to read bytes as UTF-8!

I don't really know why this should not work, however the Java 7 way would be to use StandardCharsets.UTF_8 see
http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html
in the (new) Constructor InputStreamReader(InputStream in, Charset cs), see
http://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html.

UTF-8 encoding in JLabel on Windows

I have a problem with encoding in JLabel on Windows(on *nix OSes everything is okay).
Here's an image: http://i.imgur.com/DEkj3.png (the problematic character is the L with ` on the top, it should be ł) and here the code:
public void run()
{
URL url;
HttpURLConnection conn;
BufferedReader rd;
String line;
String result = "";
try {
url = new URL(URL);
conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = rd.readLine()) != null) {
result += line;
}
rd.close();
} catch (Exception e) {
try
{
throw e;
}
catch (Exception e1)
{
Window.news.setText("");
}
}
Window.news.setText(result);
}
I've tried Window.news.setText(new String(result.getBytes(), "UTF-8"));, but it hasn't helped. Maybe I need to run my application with specified JVM flags?

You are breaking the data before it gets to the window when you use new InputStreamReader with no explicit charset. this will use the platform default charset, which is probably cp1252 on windows, hence your broken characters.
if you know the charset of the data you are reading, you should specify it explicitly, e.g.:
new InputStreamReader(conn.getInputStream(), "UTF-8")
in the case of downloading data from an arbitrary url, however, you should probably be preferring the charset in the 'Content-Type' header, if present.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java UTF-8 encoding not set to URLConnection - java

Try adding also the user agent to your URLConnection: urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36"); This solved my decoding problem like a charm.

Related

Http response code 429 while reading HTML

Download part of an url content to save bandwith

Android HTTP Request Encoding

Cannot get URL content as UTF-8

UTF-8 encoding in JLabel on Windows

Categories

Resources