strange characters(?) added to the end of my subject text

strange characters(?) added to the end of my subject text - java

I have a problem with my java code sending email to users. There is some problem with the encoding of the email. When the email arrives to email account the subject line ($subject) has encoding problems as has strange characters(?) added to the end of my subject text.
The email message content itself is fine just the subject line(?) I have searched all over but cant find,after using Unicode and content type as text/html mail body have no problem with special character
(ó) but same fix is not working for subject line.
I have a class that sends an email with javamail, with a text like
this one in subject :
"Estimado Iván Escobedo:
The problem is that when the mail arrives to its destination, it
arrives this way:
"Estimado Iv?n Escobedo:
All the á, é, í, ó, ú, etc special characters are replaced with ?.
What could be the problem and how to solve it?

You should use something like that to read the message properly:
TextMessage txtMessage = (TextMessage)message;
ByteArrayInputStream bais = new ByteArrayInputStream(txtMessage.getText().getBytes ("ISO-8859-15"))
Edit :
Sanjay found the solution.
In order to set properly the message before sending, use :
MimeUtility.encodeText(SubjectText, "ISO-8859-15", "Q")
encodeText :
Encode a RFC 822 "text" token into mail-safe form as per RFC 2047.
The given Unicode string is examined for non US-ASCII characters. If the string contains only US-ASCII characters, it is returned as-is. If the string contains non US-ASCII characters, it is first character-encoded using the specified charset, then transfer-encoded using either the B or Q encoding. The resulting bytes are then returned as a Unicode string containing only ASCII characters.
Note that this method should be used to encode only "unstructured" RFC 822 headers.

Related

"?" gets dropped in Mail Header for attachment Filename (UTF-8) Exchange server

We are using JavaMail to send mail with PDF attachments. When Unicode characters are present in the filename, the attachments seem to be named as the UTF encoded name. Upon further inspection of the mail headers found that the ? in the filename MIME is dropped. For example
Expected:
Content-Disposition: attachment;
filename="=?utf8?Q?hinzugef=C3=BCgte.pdf?="
Obtained:
Content-Disposition: attachment;
filename="=utf8Qhinzugef=C3=BCgte.pdf="
And because of this the Filename in the attachment is =utf8Qhinzugef=C3=BCgte.pdf= and we are unable to open it.
If I manually modify the .eml file and add the ? in the right places and open it in outlook, the file is displayed in PDF format as expected.
This issue has been reported in Exchange server and we are unable to reproduce it in Gmail or Fake SMTP (on my machine, used to test mail)
Sample code:
MimeBodyPart mbp2 = new MimeBodyPart();
String attFileName = file.getName();
String i18nFileName = new String(attFileName.getBytes(), "UTF-8");
String mimeType = mimeMap.getContentType(attFileName);
attStream = new FileInputStream(att);
ByteArrayDataSource bas = new ByteArrayDataSource(attStream, mimeType);
mbp2.setDataHandler(new DataHandler(bas));
mbp2.setFileName(MimeUtility.encodeText(i18nFileName));
mp.addBodyPart(mbp2);
if (attStream != null) {
attStream.close();
}
Why does this happen? Any leads would be very helpful

This is wrong encoded to begin with.
What you implemented was RFC 2047, but that doesn't apply to HTTP at all.
RFC 6266 § 4.3 explains how to deal with the filename= parameter for that HTTP header and then refers to
RFC 5897, obsoleted by RFC 8187 § 3.2.3 on how to incorporate non-ASCII.
The generic form is filename*=UTF-8''Na%C3%AFve%20file.txt and it differs in several aspects from RFC 2047 which you implemented:
filename*= should be used - note the trailing asterisk at the parameter. This is to signal extended notation - otherwise neither a charset nor percent encoding is expected.
Enclosing the value in "quotation marks" is neither needed, nor allowed when using extended notation.
Likewise the prefix =?, the suffix ?=, and the ?Q? encoding parameter are ever expected. Logically they also make no sense, as only quoted encoding is available and the whole non-ASCII scope is entirely, not just somewhere.
The '' part is for the optional language code - it could be 'en' for English, but effectively nobody cares about that.
The rest is trivial: each byte of a UTF-8 character sequence is quoted encoded. A space must be quote encoded, too (speak: %20).
The correct charset is UTF-8, while utf8 is wrong - don't rely on being accepted with that unofficial alias although it is tolerated every now and then.
In other words: the client acted correctly. If I use Thunderbird 68 and either hit CTRL+Q to see an e-mail's source, or save an e-mail as an .EML file and then look into that file, I have a multipart where each attachment has the headers
Content-Disposition: inline;
filename*=utf-8''L%20%2D%20qualita%CC%88t.pdf
Content-Type: application/pdf;
x-unix-mode=0644;
name="=?utf-8?Q?L_-_qualita=CC=88t=2Epdf?="
Don't get confused because you now see both variants - they still have different purposes and different contexts. What you want is primarily the filename (although it can't hurt to also provide a name). If you look closely the values also differ (former has spaces, latter uses underscores - but that was the sender's free decision). The UTF-8 character sequence %CC%88 or =CC=88 is the codepoint U+0308 = ̈ COMBINING DIAERESIS (making the a before an ä).
This answer explains how differently HTTP browsers treated RFC 5897 in the year 2011.

Why do I have to encode a utf-8 parameter String to iso-Latin and then decode as utf-8 to get Java utf-8 String?

I have a Java servlet that takes a parameter String (inputString) that may contain Greek letters from a web page marked up as utf-8. Before I send it to a database I have to convert it to a new String (utf8String) as follows:
String utf8String = new String(inputString.getBytes("8859_1"), "UTF-8");
This works, but, as I hope will be appreciated, I hate doing something I don't understand, even if it works.
From the method description in the Java doc the getBytes() method "Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array" i.e. I am encoding it in 8859_1 — isoLatin. And from the Constructor description "Constructs a new String by decoding the specified array of bytes using the specified charset" i.e. decodes the byte array to utf-8.
Can someone explain to me why this is necessary?

My question is based on a misconception regarding the character set used for the HTTP request. I had assumed that because I marked up the web page from which the request was sent as UTF-8 the request would be sent as UTF-8, and so the Greek characters in the parameter sent to the servlet would be read as a UTF-8 String (‘inputString’ in my line of code) by the HttpRequest.getParameter() method. This is not the case.
HTTP requests are sent as ISO-8859-1 (POST) or ASCII (GET), which are generally the same. This is part of the URI Syntax specification — thanks to Andreas for pointing me to http://wiki.apache.org/tomcat/FAQ/CharacterEncoding where this is explained.
I had also forgotten that the encoding of Greek letters such as α for the request is URL-encoding, which produces %CE%B1. The getParameter() handles this by decoding it as two ISO-8859-1 characters, %CE and %B1 — Î and ± (I checked this).
I now understand why this needs to be turned into a byte array and the bytes interpreted as UTF-8. 0xCE does not represent a one-byte character in UTF-8 and hence it is addressed with the next byte, 0xB1, to be interpretted as α. (Î is 0xC3 0x8E and ± is 0xC2 0xB1 in UTF-8.)

When decoding, could you not create a class with a decoder method that takes the bytes [] as a parameter and
return it as a string? here is an example that i have used before.
public class Decoder
{
public String decode(byte[] bytes)
{
//Turns the bytes array into a string
String decodedString = new String(bytes);
return decodedString;
}
}
Try use this instead of .getBytes(). hope this works.

URL decoding of JS-encoded Japanese characters in Java 6

I am using encodeURIComponent in javascript(assuming this does UTF-8 encoding) to encode a variable which could contain characters like =, +, etc. This is sent as POST to my servlet where I decode it.
This works well with English but when used with Japanese string - "バスケット", this converts to some special character sequence like this - "Ã£ÂÂÃ£ÂÂ¹Ã£ÂÂ±Ã£ÂÂÃ£ÂÂ"
I am using following java 1.6 code to decode it but it doesn't work -
String ID = java.net.URLDecoder.decode(assignedID,"UTF-8");
where assignedID contains special character sequence. The above code returns me - "Ã£ÂÂÃ£ÂÂ¹Ã£ÂÂ±Ã£ÂÂÃ£ÂÂ"

In your post, is the string you're sending is being sent as part of the URL or as part of the POST body. Its mostly the part of POST body, try adding (to jsp):
<% request.setCharacterEncoding("UTF-8"); %>

getParameter special characters

I'm trying to get an url parameter in jee.
So I have this kind of url :
http://MySite/MySite.jsp?page=recherche&msg=toto
First i tried with : request.getParameter("msg").toString();
it works well but if I try to search "c++" , the method "getParameter()" returns "c" and not "c++" and i understand.
So I tried another thing. I get the current URL and parse it to get the value of the message :
String msg[]= request.getQueryString().split("msg=");
message=msg[1].toString();
It works now for the research "c++" but now I can't search accent. What can I do ?
EDIT 1
I encode the message in the url
String urlString=Utils.encodeUrl(request.getParameter("msg"));
so for the URL : http://MySite/MySite.jsp?page=recherche&msg=c++
i have this encoded URL : http://MySite/MySite.jsp?page=recherche&msg=c%2B%2B
And when i need it, i decode the message of the URL
String decodedUrl = URLDecoder.decode(url, "ISO-8859-1");
Thanks everybody

Anything you send via "get" method goes as part of the url, which needs to be urlencoded to be valid in case it contains at least one of the reserved characters. So, any character will need to be encoded before sending.
In order to send c++, you would have to send c%2B%2B. That would be interpreted properly at the server side.
Here some reference you can check:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
Now the question is, how and where do you generate your URL? According to the language, you will need to use the proper method to encode your strings.

if I try to search "c++" , the method "getParameter()" returns "c" and not "c++"
Query parameters are treated as application/x-www-form-urlencoded, so a + character in the URL means a space character in the parameter value. If you want to send a + character then it needs to be encoded in the URL as %2B:
http://MySite/MySite.jsp?page=recherche&msg=c%2B%2B
The same applies to accented characters, they need to be escaped as the bytes of their UTF-8 representation, so été would need to be:
msg=%C3%A9t%C3%A9
(é being Unicode character U+00E9, which is C3 A9 in UTF-8).
In short, it's not the fault of this code, it's the fault of whatever component is responsible for constructing the URL on the client side.

Call your URL with
msg=c%2B%2B
+ in a URL mean 'space'. It needs to be escaped.

You need to escape special characters when passing them as URL parameters. Since + means space and & means and another parameter, these cannot be used as parameter values.
See this other S.O. question.
You may want to use the Apache HTTP client library to help you with the URL encoding/decoding. The URIUtil class has what you need.
Something like this should work:
String rawParam = request.getParameter("msg");
String msgParam = URIUtil.decode(rawParam);
Your example indicates that the data is not being properly encoded on the client side. See this JavaScript question.

Convert and Display the UTF8 Encoded String

I have a JSON response which i want to store in DB and display in text view or edit text. This json response is encoded by UTF-8 format.
Response is somthing like
"currencies": [[0,"RUR"," ",1,0],[1,"EUR","â¬",1.44,100],[2,"GBP","Â£",1.6,100],[3,"JPY","Â¥",0.0125,100],[4,"AUD","$",1.1,100]]}
where â¬,Â£,Â¥ are currency symbol. I have to decode this and then display. This symbols are symbol in Unicode (transferrred as UTF8). How can I convert this encoded symbol. Plz help.
I tried this but it didnt works:
byte[] b = stringSymbol.getBytes("UTF-8"); // â¬,Â£,Â¥
final String str = new String(b);

You're showing the text with non-currency symbols... it's as if you're taking the original text, then encoding that as UTF-8, then decoding it as ISO-8859-1.
It's just text - you shouldn't need to do anything to it afterwards, and you should never see it in this broken format. If you have to convert the text back to bytes and then to a string again, that means you've already lost, basically.
Check the headers on the HTTP response which returns the JSON - I suspect you'll find that it's claiming the data is ISO-8859-1 rather than UTF-8. The actual encoding has to match the encoding that's specified in the headers, otherwise you end up with this sort of effect.
Another possibility is that whatever's returning the JSON is accurately giving you the data that it knows about, and that the data is broken upstream. You should follow the data step by step (assuming you own all the links in the chain) until you can see where you're first encountering this brokenness.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.