Reading UTF-8 encoded XML from URL in java

Reading UTF-8 encoded XML from URL in java - java

I'm trying to read XML data from Google weather webservice. The response contain some Spanish characters. Problem is that these characters are not displayed properly. I've tried to convert everything to UTF-8 but that does not seem to help. Code is given below
public static void main(String[] args) {
try {
URL url = new URL("http://www.google.com/ig/api?weather=Noja&hl=es");
HttpURLConnection con = (HttpURLConnection) url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
con.getInputStream(), "UTF-8"));
String str = in.readLine();
//this does not work even
//String str = new String(in.readLine().getBytes("UTF-8"),"UTF-8");
System.out.println(str);
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output is given below (trimmed to keep the post in limits). Notice "mi�" and s�b
trimmed to keep max char limit
<day_of_week data="mi�"/><day_of_week data="s�b"/><low data="11"/><high data="16"/><icon data="/ig/images/weather/chance_of_rain.gif"/><condition data="Posibilidad de lluvia"/></forecast_conditions></weather></xml_api_reply>

If that page is xml then you should usually pass the InputStream directly to the xml parser and let it automatically detect the encoding. Otherwise you should look at the charset parameter of the content type response header to determine the correct encoding and create the appropriate InputStreamReader.
Edit: That server is indeed responding with different encodings to the browser and the java client, probably depending on the Accept-Charset request header. For firefox this header has the value
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n
This means both charset are accepted, there is no preference for either one. The server responds with a Content-Type header of text/xml; charset=UTF-8. The java client does not send this header and the server responds with text/xml; charset=ISO-8859-1.
To use the charset supplied by the server you can use code like the following:
Matcher matcher = Pattern.compile("charset\\s*=\\s*([^ ;]+)").matcher(contentType);
String charset = "utf-8"; // default
if (matcher.find()) {
charset = matcher.group(1);
}
System.out.println(con.getContentType());
BufferedReader in = new BufferedReader(new InputStreamReader(
con.getInputStream(), charset));
Edit 2: Turns out the server decides the charset to use based on the user-agent header. If you add the following line, it responds with a charset of utf-8.
con.setRequestProperty("User-Agent", "Mozilla/5.0");
Anyway, the Content-Type response header contains the correct charset to use.

Your input may be correct, although I would use an XML parser to read the XML, rather than try and interpret this as a line-by-line feed. However your output may be incorrect.
What's the default char encoding of your JVM ? Check (and set) the confusingly named property -Dfile.encoding=UTF-8
Do the requisite fonts etc. exist on your system ? Can you check the actual character codes you're outputting and not rely on your terminal settings ? I would suspect this is perhaps the case, since the encoding/decoding appears to work and you're just missing those individual characters.

Related

How to ensure that the JSON string is UTF-8 encoded in Java

I am working on a legacy web service client code where the JSON data is being sent to the web service. Recently it was found that for some requests in the JSON body, the service is giving HTTP 400 response due to invalid characters (non-UTF8) in the JSON Body.
Below is one example of the data which is causing the issue.
String value = "zu3z5eq tô‰U\f‹Á‹€z";
I am using org.json.JSONObject.toString() method to generate the JSON string. Can you please let me know how can I ensure that the JSON string is UTF-8 encoded?
I already tried few solutions like available online , like converting to byte array and then back, using java charset methods etc, but they did not work. Either they convert the valid values as well like chinese/japanese characters, or doesn't work at all.
Can you please provide some input on this?

You need to set the character encoding for OutputStreamWriter when you create it:
httpConn.connect();
wr = new OutputStreamWriter(httpConn.getOutputStream(), StandardCharsets.UTF_8);
wr.write(jsonObject.toString());
wr.flush();
Otherwise it defaults to the "platform default encoding," which is some encoding that has been used historically for text files on whatever system you are running.

Use Base64 encoding for converting the value to Byte[].
String value = "zu3z5eq tô‰U\f‹Á‹€z";
// WHILE SENDING ENCODE THE VALUE
byte[] encodedBytes = Base64.getEncoder().encode(value.getBytes("UTF-8"));
String encodedValue = new String(encodedBytes, "UTF-8");
// TRANSPORT....
// ON RECEIVING END DECODE THE VALUE
byte[] decodedBytes = Base64.getDecoder().decode(encodedValue.getBytes("UTF-8"));
System.out.println( new String(decodedBytes, "UTF-8"));

Code is not translating german characters from Google Books API correctly

I have produced a little app that searches and displays for me data which I retrieve from Google Books in a neat but simple fashion. Everything works so far, but there is an issue directly at the source: Though Google provides me correctly with German text search results, it for some reason displays all special German characters (Ä, Ö, Ü and ß probably) as the "�" dummy or sometimes just "?".
I was able to confirm that the JSONObject built from the InputStream already contains those mistakes. It seems like the original inputstream from Google is not being read correctly. Weird is that I have "UTF-8" encoding (which should contain german characters) added to my InputStreamReader, but to no avail apparently.
Here is the http-request procedure I am using:
public class HttpRequest {
public static String request(String urlString) throws IOException {
URL url = new URL(urlString);
URLConnection connection = url.openConnection();
connection.setConnectTimeout(5000);
connection.setReadTimeout(10000);
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
StringBuilder builder = new StringBuilder();
String inputLine;
while((inputLine = in.readLine()) != null)
builder.append(inputLine);
in.close();
return builder.toString();
}
}
What else could be going wrong? I checked the StringBuilder already, but the mistakes are already in the inputLine(s) that get read out of the BufferedReader.
Also, I was unable to find any language or encoding specific settings in the official google books api guide, so I guess they should come with universal encoding, but then the "UTF-8" flag should detect them, or not?

Easiest is to check the raw data in another way, such as a browser. Looking at a Google Books api url response in the browser is quite simple, just use the url and the response comes back as json. Optionally install a json viewer plugin, but not needed for this.
For example use this url:
https://www.googleapis.com/books/v1/volumes?q=Latein+key=NO
Checking the http header (in the browser developer tools for example) you can see that the header list the content as having the expected encoding:
content-type: application/json; charset=UTF-8
Look at the specific content for some German results and the text there and we can see that it is correct German special characters for some books, but not for all. Depending on the book in question.
Conclusion: UTF-8 is indeed correct and the source/raw data has missing/wrong data for some texts for the German characters.

Default character encoding in java for inputStream of HTTPUrlConnection

I am using java's InputStream of HttpUrlConnection to get body of an URL and write same to a file.
Things works fine on my laptop (Ubuntu/Centos Desktop version) but on server(centos 6.5 server edition), special characters, incoming in body gets garbled to question marks.
I tried to compare Java's Charset.defaultCharset() and System.getProperty("file.encoding"), both are same on laptop and server.
Can anyone please help me to find out what is different in laptop and server OS related to Character Encoding issue.
StringBuilder response = new StringBuilder();
URL obj = new URL("http://www.Some URL That Has spl Char (eg. EN Dash)");
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}

In the headers the encoding are often given (connection.getContentEncoding() for instance, could be null). This is useful for text, to convert an InputStream to a Reader (InputStreamReader) and such.
If you are using InputStream/OutputStream, you are working with binary data - as is -, hence no corruption will occure. But you'll loose the header info, that might have said something about the encoding. You might want to store any data with a given encoding as UTF-8 for consistency. However in HTML the encoding may be given in the content.
On the given code
The input is encoded by the default. Which is quite variable by platform, and even user settings.
Better use an explicit encoding.
// Nice if the connection has in its headers an encoding
// or in Content-Type charset=...
String encoding = con.getContentEncoding();
if (encoding == null) {
// Otherwise ISO-8859-1 is the HTTP standard, and
// browsers extend ISO-8859-1 to Windows-1252.
encoding = "Windows-1252";
}
Charset charset = Charset.forName(encoding);
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream(), charset));
Of course writing the String of the StringBuilder to a media with the right encoding.

How to parse an XML file containing BOM?

I want to parse an XML file from URL using JDOM. But when trying this:
SAXBuilder builder = new SAXBuilder();
builder.build(aUrl);
I get this exception:
Invalid byte 1 of 1-byte UTF-8 sequence.
I thought this might be the BOM issue. So I checked the source and saw the BOM in the beginning of the file. I tried reading from URL using aUrl.openStream() and removing the BOM with Commons IO BOMInputStream. But to my surprise it didn't detect any BOM.
I tried reading from the stream and writing to a local file and parse the local file. I set all the encodings for InputStreamReader and OutputStreamWriter to UTF8 but when I opened the file it had crazy characters.
I thought the problem is with the source URL encoding. But when I open the URL in browser and save the XML in a file and read that file through the process I described above, everything works fine.
I appreciate any help on the possible cause of this issue.

That HTTP server is sending the content in GZIPped form (Content-Encoding: gzip; see http://en.wikipedia.org/wiki/HTTP_compression if you don't know what that means), so you need to wrap aUrl.openStream() in a GZIPInputStream that will decompress it for you. For example:
builder.build(new GZIPInputStream(aUrl.openStream()));
Edited to add, based on the follow-up comment: If you don't know in advance whether the URL will be GZIPped, you can write something like this:
private InputStream openStream(final URL url) throws IOException
{
final URLConnection cxn = url.openConnection();
final String contentEncoding = cxn.getContentEncoding();
if(contentEncoding == null)
return cxn.getInputStream();
else if(contentEncoding.equalsIgnoreCase("gzip")
|| contentEncoding.equalsIgnoreCase("x-gzip"))
return new GZIPInputStream(cxn.getInputStream());
else
throw new IOException("Unexpected content-encoding: " + contentEncoding);
}
(warning: not tested) and then use:
builder.build(openStream(aUrl.openStream()));
. This is basically equivalent to the above — aUrl.openStream() is explicitly documented to be a shorthand for aUrl.openConnection().getInputStream() — except that it examines the Content-Encoding header before deciding whether to wrap the stream in a GZIPInputStream.
See the documentation for java.net.URLConnection.

You might find you can avoid handling encoded responses by sending a blank Accept-Encoding header. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html: "If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding.". That seems to be occurring here.

URL encoding for latin characters in Java

I'm trying to read in an image URL. As mentioned in the java documentation, I tried converting the URL to URI by
String imageURL = "http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg";
URL url = new URL(imageURL);
url = new URI(url.getProtocol(), url.getHost(), url.getFile(), null).toURL();
URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();
I get the a Java.io.FileNotFound Exception for file
http://www.shefinds.com/files/Christian-Louboutin-DÃ©colletÃ©-100-pumps.jpg
What am I doing wrong and what is the right way to encode this URL?
Update:
I'm using Rome to read in RSS feeds. Taking suggestions from BalusC I have printed out the raw input from different stages and seems like that the ROME rss parser is using ISO-8859-1 instead of UTF-8.

Works fine here (returns a 403, it's at least not a 404):
URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
InputStream input = connection.getInputStream();
When I fix it so that it doesn't return a 403, the picture is correctly retireved:
URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/4.0");
InputStream input = connection.getInputStream();
OutputStream output = new FileOutputStream("/pic.jpg");
for (int data = 0; (data = input.read()) != -1;) {
output.write(data));
}
So your problem lies somewhere else. Converting is actually not needed. The initial URL is valid.
Maybe you're obtaining the actual URL from some binary source using the wrong character encoding? The transition of é to Ã© namely suggests that the original source was UTF-8 encoded and that the code has incorrectly read it in in using ISO-8859-1 instead of UTF-8.
Update: or maybe you've actually hardcoded it in the Java source code and saving the source file itself using the wrong encoding. I've configured my editor (Eclipse) to save files using UTF-8 and the -Dfile.encoding is also defaulted to UTF-8, that would explain why it works at my machine ;)
Update 2: as per the comments, in a nutshell, everything should work fine if the encoding used to save the source file matches the default -Dfile.encoding of the runtime platform (and the character encoding in question supports the é). To avoid those unforeseen clashes whenever you like to distribute the code, it's indeed better to replace hardcoded non-ASCII chars by unicode escapes.

I think the technical answer is "you can't." Non-ASCII characters can't be used in a URL according to the standard, and even some ASCII characters must be escaped with "%XX" syntax, where XX is the ASCII value of the character.
If anything, you can escape 'é' with '%E9' but this relies on the server interpreting this as an encoding of the character according to ISO-8859-1. While this isn't technically allowed, I believe many servers will do it.

The encoding of your source file is to blame. Using your IDE, set it to UTF-8, and then repaste the URL.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading UTF-8 encoded XML from URL in java - java

Related

How to ensure that the JSON string is UTF-8 encoded in Java

Code is not translating german characters from Google Books API correctly

Default character encoding in java for inputStream of HTTPUrlConnection

How to parse an XML file containing BOM?

URL encoding for latin characters in Java

Categories

Resources