Http GET of source containing non-UTF-8 characters

Http GET of source containing non-UTF-8 characters - java

I solved an issue I had with retrieving and displaying non-UTF-8 characters but I don't understand why my solution works.
The following code:
final HttpClient client = new HttpClient();
final HttpMethod method = new GetMethod(urlString);
client.executeMethod(method);
final String responseBodyAsString = method.getResponseBodyAsString();
System.out.println(responseBodyAsString);
was messing up some characters on the display, such as YÃ¡Ã±ez
I changed:
final String responseBodyAsString = method.getResponseBodyAsString();
to
final ByteBuffer inputBuffer = ByteBuffer.wrap(method.getResponseBody());
final String responseBodyAsString = new String(inputBuffer.array());
and the same string as before is represented correctly as Yáñez
Why is that?

getResponseBodyAsString() uses the HTTP response's Content-Type header to know what the response body's charset is so the data can be converted to a String as needed. getResponseBody() simply returns the body's raw bytes as-is, which you are then converting to a String using the platform's default charset. Since you are able to get the desired String output by converting the raw bytes manually, that suggests to me that the HTTP server is not specifying a charset in the response's Content-Type header at all, or is specifying the wrong charset.
YÃ¡Ã±ez is the UTF-8 encoded version of Yáñez, so it is odd that the String(bytes[]) constructor would be able to decode it correctly, unless the platform's default charset is actually UTF-8. It does make sense for getResponseBodyAsString() to return YÃ¡Ã±ez if the response charset used is ISO-8859-1, which is the default charset for text/... media types sent over HTTP when no charset is explicitly specified, per RFC 2616 Section 3.7.1.
I would suggest looking for a bug in the server script that is sending the data (or reporting a bug report to the server admin), before suspecting a bug with getResponseBodyAsString(). You can use a packet sniffer like Wireshark, or a debugging proxy like Fiddler, to confirm the missing/invalid charset in the response Content-Type header.

Try the next:
private static final String UNICODE = "ÀàÈèÌìÒòÙùÁáÉéÍíÓóÚúÝýÂâÊêÎîÔôÛûŶŷÃãÕõÑñÄäËëÏïÖöÜüŸÿÅåÇçŐőŰű";
private static final String PLAIN_ASCII = "AaEeIiOoUuAaEeIiOoUuYyAaEeIiOoUuYyAaOoNnAaEeIiOoUuYyAaCcOoUu";
public static String convertNonAscii(String str) {
if (str == null) {
return null;
}
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
int pos = UNICODE.indexOf(c);
if (pos > -1)
sb.append(PLAIN_ASCII.charAt(pos));
else {
sb.append(c);
}
}
return sb.toString();
}
public static void main(String[] args) {
Pattern p = Pattern.compile("[^\\x00-\\x7E]", Pattern.CASE_INSENSITIVE);
System.out.println(p.matcher(UNICODE).find());
System.out.println(p.matcher(PLAIN_ASCII).find());
System.out.println(convertNonAscii("ú or ñ"));
}
Output:
true
false
u or n

Related

inputStream and utf 8 sometimes shows "?" characters

So I've been dealing with this problem for over a months now and I also checked almost every possible related solution over here in and over google but I couldn't find anything that really solved my case.
my problem is that i'm trying to download an html source from a website but what i'm getting in most cases is that some of the text shows some "?" characters in it,most likely beacuse the site is in Hebrew.
Here's my code,
public static InputStream openHttpGetConnection(String url)
throws Exception {
InputStream inputStream = null;
HttpClient httpClient = new DefaultHttpClient();
HttpResponse httpResponse = httpClient.execute(new HttpGet(url));
inputStream = httpResponse.getEntity().getContent();
return inputStream;
}
public static String downloadSource(String url) {
int BUFFER_SIZE = 1024;
InputStream inputStream = null;
try {
inputStream = openHttpGetConnection(url);
} catch (Exception e) {
// TODO: handle exception
}
int bytesRead;
String str = "";
byte[] inpputBuffer = new byte[BUFFER_SIZE];
try {
while ((bytesRead = inputStream.read(inpputBuffer)) > 0) {
String read = new String(inpputBuffer, 0, bytesRead,"UTF-8");
str +=read;
}
} catch (Exception e) {
// TODO: handle exception
}
return str;
}
Thanks.

To read characters from a byte stream with a given encoding, use a Reader. In your case it would be something like:
InputStreamReader isr = new InputStreamReader(inpputStream, "UTF-8");
char[] inputBuffer = new char[BUFFER_SIZE];
while ((charsRead = isr.read(inputBuffer, 0, BUFFER_SIZE)) > 0) {
String read = new String(inputBuffer, 0, charsRead);
str += read;
}
You can see that the bytes will be read in directly as characters --- it's the reader's problem to know if it needs to read one or two bytes, e.g., to create the character in the buffer. It's basically your approach but decoding as the bytes are being read in, instead of after.

Converting an InputStream to a String entails specifying an encoding, just as you do at new String(inpputBuffer, 0, bytesRead,"UTF-8");.
But your approach as several drawbacks.
How do you know you have to use UTF8 ?
When retreiving HTTP content, generally speaking, you can not know in advance what encoding will be used in the HTTP response. But HTTP provides a mechanism for specifying that, using the Content-Type header.
More specifically, your response object should have a Content-Type "header", that has an "attribute" called encoding. In the response, it should look something like :
Content-Type: text/html; encoding=UTF-8
You should use whatever is after the encoding= part to transform your bytes to chars.
Seeing you seem to use Apache HTTPClient, their documentation states :
You can set the content type header for a request with the addRequestHeader method in each method and retrieve the encoding for the response body with the getResponseCharSet method.
If the response is known to be a String, you can use the getResponseBodyAsString method which will automatically use the encoding specified in the Content-Type header or ISO-8859-1 if no charset is specified..
Alternate way
If there is no Content-Type header, and if you know your content is HTML, then you can try to convert it as a String using some encoding (UTF or ISO Latin preferably), and try to find some content matching <meta charset="UTF-8">, and use that as the charset. This should only be a fail-over.
Any byte sequence is not convertible to a String
Drawback number two is that you read any number of bytes from your stream, and try to convert it to a String, which may not be possible.
In practice, UTF-8 can encode some "characters" across several bytes. For example "é" can be encoded as 0xC3A9. So say for example that the response consists of two "é" characters. If your first call to read returns :
[c3, a9, c3]
Your conversion to a String using new String(byte[], off, enc) will leave the last byte apart, because it does not match a valid UTF8 sequence.
Your following read will get what's left to read
[a9]
Which is (whatever that is) not a "é" character.
Bottom line : you can not convert even a valid UTF-8 sequence to byte using your pattern.
Going forward : you use HTTPClient, use their method of HTTP Response to String conversion.
If you wish to do it yourself, the easy way is to copy your input to a byte array, and then convert the byte array. Something along the lines of (pseudo code) :
ByteArrayOutputStream responseContent = new ByteArrayOutputStream()
copyAllBytes(responseInputStream, responseContent)
byte[] rawResponse = responseContent.toByteArray();
String stringResponse = new String(rawResponse, encoding);
But you could also use a CharsetDecoder if you want a fully streamed implementation (one that does not buffer the response fully into memory), or as #jas answers, wrap your inputStream to a reader and concatenate the output (preferably into a StringBuilder, which should be faster if a high number of concatenation is to occur).

Encoding issue while reading the content of an email with JavaMail

I'm reading the messages from an email account by using JavaMail 1.4.1 (I've upgraded to 1.4.5 version but with the same problem), but I'm having issues with the encoding of the content:
POP3Message pop3message;
...
Object contentObject = pop3message.getContent();
...
String contentType = pop3message.getContentType();
String content = contentObject.toString();
Some messages are read properly, but others have strange characters because of a not suitable encoding. I have realized it doesn't work for a specific content type.
It works well if the contentType is any of these:
text/plain; charset=ISO-8859-1
text/plain;
charset="iso-8859-1"
text/plain;
charset="ISO-8859-1";
format="flowed"
text/plain; charset=windows-1252
but it doesn't if it is:
text/plain;
charset="utf-8"
for this contentType (UTF-8 one) if I try to get the encoding (pop3message.getEncoding()) I get
quoted-printable
For the latter encoding I get for example in the debugger in the String value (in the same way as I see it in the database after persisting the object):
UbicaciÃ³n (instead of Ubicación)
But if I open the email with the email client in a browser it can be read without any problem, and it's a normal message (no attachments, just text), so the message seems to be OK.
Any idea about how to solve this issue?
Thanks.
UPDATE
This is the piece of code I've added to try the function getUTF8Content() given by jlordo
POP3Message pop3message = (POP3Message) message;
String uid = pop3folder.getUID(message);
//START JUST FOR TESTING PURPOSES
if(uid.trim().equals("1401")){
Object utfContent = pop3message.getContent();
System.out.println(utfContent.getClass().getName()); // it is of type String
//System.out.println(utfContent); // if not commmented it prints the content of one of the emails I'm having problems with.
System.out.println(pop3message.getEncoding()); //prints: quoted-printable
System.out.println(pop3message.getContentType()); //prints: text/plain; charset="utf-8"
String utfContentString = getUTF8Content(utfContent); // throws java.lang.ClassCastException: java.lang.String cannot be cast to javax.mail.util.SharedByteArrayInputStream
System.out.println(utfContentString);
}
//END TEST CODE

How are you detecting that these messages have "strange characters"? Are you displaying the data somewhere? It's possible that whatever method you're using to display the data isn't handling Unicode characters properly.
The first step is to determine whether the problem is that you're getting the wrong characters, or that the correct characters are being displayed incorrectly. You can examine the Unicode values of each character in the data (e.g., in the String returned from the getContent method) to make sure each character has the correct Unicode value. If it does, the problem is with the method you're using to display the characters.

try this and let me know if it works:
if ( *check if utf 8 here* ) {
content = getUTF8Content(contentObject);
}
// TODO take care of UnsupportedEncodingException,
// IOException and ClassCastException
public static String getUTF8Content(Object contentObject) {
// possible ClassCastException
SharedByteArrayInputStream sbais = (SharedByteArrayInputStream) contentObject;
// possible UnsupportedEncodingException
InputStreamReader isr = new InputStreamReader(sbais, Charset.forName("UTF-8"));
int charsRead = 0;
StringBuilder content = new StringBuilder();
int bufferSize = 1024;
char[] buffer = new char[bufferSize];
// possible IOException
while ((charsRead = isr.read(buffer)) != -1) {
content.append(Arrays.copyOf(buffer, charsRead));
}
return content.toString();
}
BTW, is JavaMail 1.4.1 a requirement? Up to date version is 1.4.5.

What worked for me was that I called getContentType() and I would check if the String contains a "utf" in it (defining the charset used as one of UTF).
If yes, I would treat the content differently in this case.
private String encodeCorrectly(InputStream is) {
java.util.Scanner s = new java.util.Scanner(is, StandardCharsets.UTF_8.toString()).useDelimiter("\\A");
return s.hasNext() ? s.next() : "";
}
(a modification of a IS to String converter from this answer on SO)
The important part here is using the correct Charset. This solved the issue for me.

First of all you must add headers according to UTF-8 encoding this way:
...
MimeMessage msg = new MimeMessage(session);
msg.setHeader("Content-Type", "text/html; charset=UTF-8");
msg.setHeader("Content-Transfer-Encoding", "8bit");
msg.setFrom(new InternetAddress(doConversion(from)));
msg.setRecipients(javax.mail.Message.RecipientType.TO, address);
msg.setSubject(asunto, "UTF-8");
MimeBodyPart mbp1 = new MimeBodyPart();
mbp1.setContent(text, "text/html; charset=UTF-8");
Multipart mp = new MimeMultipart();
mp.addBodyPart(mbp1);
...
But for 'from' header, i use the following method to convert characters:
public String doConversion(String original) {
if(original == null) return null;
String converted = original.replaceAll("á", "\u00c3\u00a1");
converted = converted.replaceAll("Á", "\u00c3\u0081");
converted = converted.replaceAll("é", "\u00c3\u00a9");
converted = converted.replaceAll("É", "\u00c3\u0089");
converted = converted.replaceAll("í", "\u00c3\u00ad");
converted = converted.replaceAll("Í", "\u00c3\u008d");
converted = converted.replaceAll("ó", "\u00c3\u00b3");
converted = converted.replaceAll("Ó", "\u00c3\u0093");
converted = converted.replaceAll("ú", "\u00c3\u00ba");
converted = converted.replaceAll("Ú", "\u00c3\u009a");
converted = converted.replaceAll("ñ", "\u00c3\u00b1");
converted = converted.replaceAll("Ñ", "\u00c3\u0091");
converted = converted.replaceAll("€", "\u00c2\u0080");
converted = converted.replaceAll("¿", "\u00c2\u00bf");
converted = converted.replaceAll("ª", "\u00c2\u00aa");
converted = converted.replaceAll("º", "\u00c2\u00b0");
return converted;
}
You can see the corresponding UTF-8 hex encoding in UTF at http://www.fileformat.info/info/charset/UTF-8/list.htm if you need to include some other characters.

Get Multilingual Data from ByteBuffer

I am receiving ByteBuffers in an UDP Java application.
Now the data in this ByteBuffer can be any string in any language or any special chars separated by zero.
I use following code to get Strings from it.
public String getString() {
byte[] remainingBytes = new byte[this.byteBuffer.remaining()];
this.byteBuffer.slice().get(remainingBytes);
String dataString = new String(remainingBytes);
int stringEnd = dataString.indexOf(0);
if(stringEnd == -1) {
return null;
} else {
dataString = dataString.substring(0, stringEnd);
this.byteBuffer.position(this.byteBuffer.position() + dataString.getBytes().length + 1);
return dataString;
}
}
These strings are stored in MySQL DB with everything set as UTF8.
IF i run application in Windows then special chars like ® are displayed but chinese are not.
On adding VM argument -Dfile.encoding=UTF8 chinese are displayed but chars like ® are shown as ?? etc.
Please Help.
Edit:
Input Strings in UDP packet are variable-length byte field, encoded in UTF-8, terminated by 0x00
For JDBC also i use useUnicode=true&characterEncoding=UTF-8

String dataString = new String(remainingBytes); is wrong. You should almost never do that. You should find out what encoding was used to put the bytes into the UDP packet, and use the same encoding on that line:
String dataString = new String(remainingBytes, encoding); // e.g. "UTF-8"
Edit: based on your updated question, encoding should be "UTF-8"

Not sure, but dataString contains only data till this zero, because stringEnd shows on first zero postion but not behind.
dataString = dataString.substring(0, stringEnd+1);
or
char specChar = dataString.substring(stringEnd, stringEnd+1); and it should return only special character, but as I said in the biggining, not sure...

Check if a String contains encoded characters

Hello I am looking for a way to detect if a string has being encoded
For example
String name = "Hellä world";
String encoded = new String(name.getBytes("utf-8"), "iso8859-1");
The output of this encoded variable is:
HellÃ¤ world
As you can see there is an A with grave and another symbol. Is there a way to check if the output contains encoded characters?

Sounds like you want to check if a string that was decoded from bytes in latin1 could have been decoded in UTF-8, too. That's easy because illegal byte sequences are replaced by the character \ufffd:
String recoded = new String(encoded.getBytes("iso-8859-1"), "UTF-8");
return recoded.indexOf('\uFFFD') == -1; // No replacement character found

Your question doesn't make sense. A java String is a list of characters. They don't have an encoding until you convert them into bytes, at which point you need to specify one (although you will see a lot of code that uses the platform default, which is what e.g. String.getBytes() with no argument does).
I suggest you read this http://kunststube.net/encoding/.

String name = "Hellä world";
String encoded = new String(name.getBytes("utf-8"), "iso8859-1");
This code is just a character corruption bug. You take a UTF-16 string, transcode it to UTF-8, pretend it is ISO-8859-1 and transcode it back to UTF-16, resulting in incorrectly encoded characters.

If I correctly understood your question, this code may help you. The function isEncoded check if its parameter could be encoded as ascii or if it contains non ascii-chars.
public boolean isEncoded(String text){
Charset charset = Charset.forName("US-ASCII");
String checked=new String(text.getBytes(charset),charset);
return !checked.equals(text);
}
#Test
public void testAscii() throws Exception{
Assert.assertFalse(isEncoded("Hello world"));
}
#Test
public void testNonAscii() throws Exception{
Assert.assertTrue(isEncoded("Hellä world"));
}
You can also check for other charset changing charset var or moving it to a parameter.

I'm not really sure what are you trying to do or what is your problem.
This line doesn't make any sense:
String encoded = new String(name.getBytes("utf-8"), "iso8859-1");
You are encoding your name into "UTF-8" and then trying to decode as "iso8859-1".
If you what to encode your name as "iso8859-1" just do name.getBytes("iso8859-1").
Please tell us what is the problem you encountered so that we can help more.

You can check that your string is encoded or not by this code
public boolean isEncoded(String input) {
char[] charArray = input.toCharArray();
for (int i = 0, charArrayLength = charArray.length; i < charArrayLength; i++) {
Character c = charArray[i];
if (Character.getType(c) == Character.OTHER_LETTER)){
return true;
}
}
return false;
}

Regex and ISO-8859-1 charset in java

I have some text encoded in ISO-8859-1 which I then extract some data from using Regex.
The problem is that the strings I get from the matcher object is in the wrong format, scrambling chars like "ÅÄÖ".
How do I stop the regex library from scrambling my chars?
Edit: Here's some code:
private HttpResponse sendGetRequest(String url) throws ClientProtocolException, IOException
{
HttpGet get = new HttpGet(url);
return hclient.execute(get);
}
private static String getResponseBody(HttpResponse response) throws IllegalStateException, IOException
{
InputStream input = response.getEntity().getContent();
StringBuilder builder = new StringBuilder();
int read;
byte[] tmp = new byte[1024];
while ((read = input.read(tmp))!=-1)
{
builder.append(new String(tmp), 0,read-1);
}
return builder.toString();
}
HttpResponse response = sendGetRequest(url);
String html = getResponseBody(response);
Matcher matcher = forum_pattern.matcher(html);
while(matcher.find()) // do stuff

This is probably the immediate cause of your problem, and it's definitely an error:
builder.append(new String(tmp), 0, read-1);
When you call one of the new String(byte[]) constructors that doesn't take a Charset, it uses the platform default encoding. Apparently, the default encoding on your your platform is not ISO-8859-1. You should be able to get the charset name from the response headers so you can supply it to the constructor.
But you shouldn't be using a String constructor for this anyway; the proper way is to use an InputStreamReader. If the encoding were one of the multi-byte ones like UTF-8, you could easily corrupt the data because a chunk of bytes happened to end in the middle of a character.
In any case, never, ever use a new String(byte[]) constructor or a String.getBytes() method that doesn't accept a Charset parameter. Those methods should be deprecated, and should emit ferocious warnings when anyone uses them.

It's html from a website.
Use a HTML parser and this problem and all future potential problems will disappear.
I can recommend picking Jsoup for the job.
See also:
Regular Expressions - Now you have two problems
Parsing HTML - The Cthulhu way
Pros and cons of HTML parsers in Java

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.