Encoding issue while reading the content of an email with JavaMail - java

I'm reading the messages from an email account by using JavaMail 1.4.1 (I've upgraded to 1.4.5 version but with the same problem), but I'm having issues with the encoding of the content:
POP3Message pop3message;
...
Object contentObject = pop3message.getContent();
...
String contentType = pop3message.getContentType();
String content = contentObject.toString();
Some messages are read properly, but others have strange characters because of a not suitable encoding. I have realized it doesn't work for a specific content type.
It works well if the contentType is any of these:
text/plain; charset=ISO-8859-1
text/plain;
charset="iso-8859-1"
text/plain;
charset="ISO-8859-1";
format="flowed"
text/plain; charset=windows-1252
but it doesn't if it is:
text/plain;
charset="utf-8"
for this contentType (UTF-8 one) if I try to get the encoding (pop3message.getEncoding()) I get
quoted-printable
For the latter encoding I get for example in the debugger in the String value (in the same way as I see it in the database after persisting the object):
Ubicación (instead of Ubicación)
But if I open the email with the email client in a browser it can be read without any problem, and it's a normal message (no attachments, just text), so the message seems to be OK.
Any idea about how to solve this issue?
Thanks.
UPDATE
This is the piece of code I've added to try the function getUTF8Content() given by jlordo
POP3Message pop3message = (POP3Message) message;
String uid = pop3folder.getUID(message);
//START JUST FOR TESTING PURPOSES
if(uid.trim().equals("1401")){
Object utfContent = pop3message.getContent();
System.out.println(utfContent.getClass().getName()); // it is of type String
//System.out.println(utfContent); // if not commmented it prints the content of one of the emails I'm having problems with.
System.out.println(pop3message.getEncoding()); //prints: quoted-printable
System.out.println(pop3message.getContentType()); //prints: text/plain; charset="utf-8"
String utfContentString = getUTF8Content(utfContent); // throws java.lang.ClassCastException: java.lang.String cannot be cast to javax.mail.util.SharedByteArrayInputStream
System.out.println(utfContentString);
}
//END TEST CODE

How are you detecting that these messages have "strange characters"? Are you displaying the data somewhere? It's possible that whatever method you're using to display the data isn't handling Unicode characters properly.
The first step is to determine whether the problem is that you're getting the wrong characters, or that the correct characters are being displayed incorrectly. You can examine the Unicode values of each character in the data (e.g., in the String returned from the getContent method) to make sure each character has the correct Unicode value. If it does, the problem is with the method you're using to display the characters.

try this and let me know if it works:
if ( *check if utf 8 here* ) {
content = getUTF8Content(contentObject);
}
// TODO take care of UnsupportedEncodingException,
// IOException and ClassCastException
public static String getUTF8Content(Object contentObject) {
// possible ClassCastException
SharedByteArrayInputStream sbais = (SharedByteArrayInputStream) contentObject;
// possible UnsupportedEncodingException
InputStreamReader isr = new InputStreamReader(sbais, Charset.forName("UTF-8"));
int charsRead = 0;
StringBuilder content = new StringBuilder();
int bufferSize = 1024;
char[] buffer = new char[bufferSize];
// possible IOException
while ((charsRead = isr.read(buffer)) != -1) {
content.append(Arrays.copyOf(buffer, charsRead));
}
return content.toString();
}
BTW, is JavaMail 1.4.1 a requirement? Up to date version is 1.4.5.

What worked for me was that I called getContentType() and I would check if the String contains a "utf" in it (defining the charset used as one of UTF).
If yes, I would treat the content differently in this case.
private String encodeCorrectly(InputStream is) {
java.util.Scanner s = new java.util.Scanner(is, StandardCharsets.UTF_8.toString()).useDelimiter("\\A");
return s.hasNext() ? s.next() : "";
}
(a modification of a IS to String converter from this answer on SO)
The important part here is using the correct Charset. This solved the issue for me.

First of all you must add headers according to UTF-8 encoding this way:
...
MimeMessage msg = new MimeMessage(session);
msg.setHeader("Content-Type", "text/html; charset=UTF-8");
msg.setHeader("Content-Transfer-Encoding", "8bit");
msg.setFrom(new InternetAddress(doConversion(from)));
msg.setRecipients(javax.mail.Message.RecipientType.TO, address);
msg.setSubject(asunto, "UTF-8");
MimeBodyPart mbp1 = new MimeBodyPart();
mbp1.setContent(text, "text/html; charset=UTF-8");
Multipart mp = new MimeMultipart();
mp.addBodyPart(mbp1);
...
But for 'from' header, i use the following method to convert characters:
public String doConversion(String original) {
if(original == null) return null;
String converted = original.replaceAll("á", "\u00c3\u00a1");
converted = converted.replaceAll("Á", "\u00c3\u0081");
converted = converted.replaceAll("é", "\u00c3\u00a9");
converted = converted.replaceAll("É", "\u00c3\u0089");
converted = converted.replaceAll("í", "\u00c3\u00ad");
converted = converted.replaceAll("Í", "\u00c3\u008d");
converted = converted.replaceAll("ó", "\u00c3\u00b3");
converted = converted.replaceAll("Ó", "\u00c3\u0093");
converted = converted.replaceAll("ú", "\u00c3\u00ba");
converted = converted.replaceAll("Ú", "\u00c3\u009a");
converted = converted.replaceAll("ñ", "\u00c3\u00b1");
converted = converted.replaceAll("Ñ", "\u00c3\u0091");
converted = converted.replaceAll("€", "\u00c2\u0080");
converted = converted.replaceAll("¿", "\u00c2\u00bf");
converted = converted.replaceAll("ª", "\u00c2\u00aa");
converted = converted.replaceAll("º", "\u00c2\u00b0");
return converted;
}
You can see the corresponding UTF-8 hex encoding in UTF at http://www.fileformat.info/info/charset/UTF-8/list.htm if you need to include some other characters.

Related

Why does my code return unicode characters?

String encodedInputText = URLEncoder.encode("input=" + question, "UTF-8");
urlStr = Parameters.getWebserviceURL();
URL url = new URL(urlStr + encodedInputText + "&sku=" + sku);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
jsonOutput = in.readLine();
in.close();
The problem is that the returned JSON string contains all unicodes like
"question":"\u51e0\u5339\u7684",
Not the actual Chinese characters. The "UTF-8" should solve the problem. Why doesn't it?
EDIT:
ObjectMapper mapper = new ObjectMapper();
ResponseList = responseList = mapper.readValue(jsonOutput, ResponseList.class);
This is not problem of encoding, it is problem your data source. Encoding comes into play when you convert bytes into string. You expect encoding to convert string in form of \uxxxx into another string which is not going to happen.
The whole point is, that the source of data is serializing data this way so your raw data is gone and is replaced with \uxxxx.
Now you would have to manualy capture \uxxx sequences and convert that to actual characters.

inputStream and utf 8 sometimes shows "?" characters

So I've been dealing with this problem for over a months now and I also checked almost every possible related solution over here in and over google but I couldn't find anything that really solved my case.
my problem is that i'm trying to download an html source from a website but what i'm getting in most cases is that some of the text shows some "?" characters in it,most likely beacuse the site is in Hebrew.
Here's my code,
public static InputStream openHttpGetConnection(String url)
throws Exception {
InputStream inputStream = null;
HttpClient httpClient = new DefaultHttpClient();
HttpResponse httpResponse = httpClient.execute(new HttpGet(url));
inputStream = httpResponse.getEntity().getContent();
return inputStream;
}
public static String downloadSource(String url) {
int BUFFER_SIZE = 1024;
InputStream inputStream = null;
try {
inputStream = openHttpGetConnection(url);
} catch (Exception e) {
// TODO: handle exception
}
int bytesRead;
String str = "";
byte[] inpputBuffer = new byte[BUFFER_SIZE];
try {
while ((bytesRead = inputStream.read(inpputBuffer)) > 0) {
String read = new String(inpputBuffer, 0, bytesRead,"UTF-8");
str +=read;
}
} catch (Exception e) {
// TODO: handle exception
}
return str;
}
Thanks.
To read characters from a byte stream with a given encoding, use a Reader. In your case it would be something like:
InputStreamReader isr = new InputStreamReader(inpputStream, "UTF-8");
char[] inputBuffer = new char[BUFFER_SIZE];
while ((charsRead = isr.read(inputBuffer, 0, BUFFER_SIZE)) > 0) {
String read = new String(inputBuffer, 0, charsRead);
str += read;
}
You can see that the bytes will be read in directly as characters --- it's the reader's problem to know if it needs to read one or two bytes, e.g., to create the character in the buffer. It's basically your approach but decoding as the bytes are being read in, instead of after.
Converting an InputStream to a String entails specifying an encoding, just as you do at new String(inpputBuffer, 0, bytesRead,"UTF-8");.
But your approach as several drawbacks.
How do you know you have to use UTF8 ?
When retreiving HTTP content, generally speaking, you can not know in advance what encoding will be used in the HTTP response. But HTTP provides a mechanism for specifying that, using the Content-Type header.
More specifically, your response object should have a Content-Type "header", that has an "attribute" called encoding. In the response, it should look something like :
Content-Type: text/html; encoding=UTF-8
You should use whatever is after the encoding= part to transform your bytes to chars.
Seeing you seem to use Apache HTTPClient, their documentation states :
You can set the content type header for a request with the addRequestHeader method in each method and retrieve the encoding for the response body with the getResponseCharSet method.
If the response is known to be a String, you can use the getResponseBodyAsString method which will automatically use the encoding specified in the Content-Type header or ISO-8859-1 if no charset is specified..
Alternate way
If there is no Content-Type header, and if you know your content is HTML, then you can try to convert it as a String using some encoding (UTF or ISO Latin preferably), and try to find some content matching <meta charset="UTF-8">, and use that as the charset. This should only be a fail-over.
Any byte sequence is not convertible to a String
Drawback number two is that you read any number of bytes from your stream, and try to convert it to a String, which may not be possible.
In practice, UTF-8 can encode some "characters" across several bytes. For example "é" can be encoded as 0xC3A9. So say for example that the response consists of two "é" characters. If your first call to read returns :
[c3, a9, c3]
Your conversion to a String using new String(byte[], off, enc) will leave the last byte apart, because it does not match a valid UTF8 sequence.
Your following read will get what's left to read
[a9]
Which is (whatever that is) not a "é" character.
Bottom line : you can not convert even a valid UTF-8 sequence to byte using your pattern.
Going forward : you use HTTPClient, use their method of HTTP Response to String conversion.
If you wish to do it yourself, the easy way is to copy your input to a byte array, and then convert the byte array. Something along the lines of (pseudo code) :
ByteArrayOutputStream responseContent = new ByteArrayOutputStream()
copyAllBytes(responseInputStream, responseContent)
byte[] rawResponse = responseContent.toByteArray();
String stringResponse = new String(rawResponse, encoding);
But you could also use a CharsetDecoder if you want a fully streamed implementation (one that does not buffer the response fully into memory), or as #jas answers, wrap your inputStream to a reader and concatenate the output (preferably into a StringBuilder, which should be faster if a high number of concatenation is to occur).

Convert encoded string to readable string in java

I am trying to send a POST request from a C# program to my java server.
I send the request together with an json object.
I recive the request on the server and can read what is sent using the following java code:
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
OutputStream out = conn.getOutputStream();
String line = reader.readLine();
String contentLengthString = "Content-Length: ";
int contentLength = 0;
while(line.length() > 0){
if(line.startsWith(contentLengthString))
contentLength = Integer.parseInt(line.substring(contentLengthString.length()));
line = reader.readLine();
}
char[] temp = new char[contentLength];
reader.read(temp);
String s = new String(temp);
The string s is now the representation of the json object that i sent from the C# client. However, some characters are now messed up.
Original json object:
{"key1":"value1","key2":"value2","key3":"value3"}
recived string:
%7b%22key1%22%3a%22value1%22%2c%22key2%22%3a%22value2%22%2c%22key3%22%3a%22value3%22%%7d
So my question is: How do I convert the recived string so it looks like the original one?
Seems like URL Encoded so why not use java.net.URLDecoder
String s = java.net.URLDecoder.decode(new String(temp), StandardCharsets.UTF_8);
This is assuming the Charset is in fact UTF-8
Those appear the be URL encoded, so I'd use URLDecoder, like so
String in = "%7b%22key1%22%3a%22value1%22%2c%22key2"
+ "%22%3a%22value2%22%2c%22key3%22%3a%22value3%22%7d";
try {
String out = URLDecoder.decode(in, "UTF-8");
System.out.println(out);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Note you seemed to have an extra percent in your example, because the above prints
{"key1":"value1","key2":"value2","key3":"value3"}

How to convert UTF-8 to GBK string in java

I retrieved HTML string from an objective site and within it there is a section
class="f9t" name="Óû§Ãû:ôâÈ»12"
I know it's in GBK encoding, as I can see it from the FF browser display. But I do not know how to convert that name string into a readable GBK string (such as 上海 or 北京).
I am using
String sname = new String(name.getBytes(), "UTF-8");
byte[] gbkbytes = sname.getBytes("gb2312");
String gbkStr = new String( gbkbytes );
System.out.println(gbkStr);
but it's not printed right in GBK text
???¡ì??:????12
I have no clue how to proceed.
You can try this if you already read the name with a wrong encoding and get the wrong name value "Óû§Ãû:ôâÈ»12", as #Karol S suggested:
new String(name.getBytes("ISO-8859-1"), "GBK")
Or if you read a GBK or GB2312 string from internet or a file, use something like this to get the right string at the first place:
BufferedReader r = new BufferedReader(new InputStreamReader(is,"GBK")); name = r.readLine();
Assuming that name.getBytes() returns GBK encoded string it's enough to create string specifying encoding of array of bytes:
new String(gbkString.getBytes(), "GBK");
Regarding to documentation the name of encryption should be GBK.
Sample code:
String gbkString = "Óû§Ãû:ôâÈ»12";
String utfString = new String(gbkString.getBytes(), "GBK");
System.out.println(utfString);
Result (not 100% sure that it's correct :) ):
脫脙禄搂脙没:么芒脠禄12

Http GET of source containing non-UTF-8 characters

I solved an issue I had with retrieving and displaying non-UTF-8 characters but I don't understand why my solution works.
The following code:
final HttpClient client = new HttpClient();
final HttpMethod method = new GetMethod(urlString);
client.executeMethod(method);
final String responseBodyAsString = method.getResponseBodyAsString();
System.out.println(responseBodyAsString);
was messing up some characters on the display, such as Yáñez
I changed:
final String responseBodyAsString = method.getResponseBodyAsString();
to
final ByteBuffer inputBuffer = ByteBuffer.wrap(method.getResponseBody());
final String responseBodyAsString = new String(inputBuffer.array());
and the same string as before is represented correctly as Yáñez
Why is that?
getResponseBodyAsString() uses the HTTP response's Content-Type header to know what the response body's charset is so the data can be converted to a String as needed. getResponseBody() simply returns the body's raw bytes as-is, which you are then converting to a String using the platform's default charset. Since you are able to get the desired String output by converting the raw bytes manually, that suggests to me that the HTTP server is not specifying a charset in the response's Content-Type header at all, or is specifying the wrong charset.
Yáñez is the UTF-8 encoded version of Yáñez, so it is odd that the String(bytes[]) constructor would be able to decode it correctly, unless the platform's default charset is actually UTF-8. It does make sense for getResponseBodyAsString() to return Yáñez if the response charset used is ISO-8859-1, which is the default charset for text/... media types sent over HTTP when no charset is explicitly specified, per RFC 2616 Section 3.7.1.
I would suggest looking for a bug in the server script that is sending the data (or reporting a bug report to the server admin), before suspecting a bug with getResponseBodyAsString(). You can use a packet sniffer like Wireshark, or a debugging proxy like Fiddler, to confirm the missing/invalid charset in the response Content-Type header.
Try the next:
private static final String UNICODE = "ÀàÈèÌìÒòÙùÁáÉéÍíÓóÚúÝýÂâÊêÎîÔôÛûŶŷÃãÕõÑñÄäËëÏïÖöÜüŸÿÅåÇçŐőŰű";
private static final String PLAIN_ASCII = "AaEeIiOoUuAaEeIiOoUuYyAaEeIiOoUuYyAaOoNnAaEeIiOoUuYyAaCcOoUu";
public static String convertNonAscii(String str) {
if (str == null) {
return null;
}
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
int pos = UNICODE.indexOf(c);
if (pos > -1)
sb.append(PLAIN_ASCII.charAt(pos));
else {
sb.append(c);
}
}
return sb.toString();
}
public static void main(String[] args) {
Pattern p = Pattern.compile("[^\\x00-\\x7E]", Pattern.CASE_INSENSITIVE);
System.out.println(p.matcher(UNICODE).find());
System.out.println(p.matcher(PLAIN_ASCII).find());
System.out.println(convertNonAscii("ú or ñ"));
}
Output:
true
false
u or n

Categories

Resources