Java difference between two URL Encoded strings

Java difference between two URL Encoded strings - java

What is the difference between the following two encoded strings?
%D0%9E%D0%BA%D0%B6%D1%8D%D0%B7
and
%26%231055%3B%26%231088%3B%26%231080%3B%26%231074%3B%26%231077%3B%26%231090%3B
I am trying to URL Encode the russian text "Привет" into the second encoded string above (the W3Schools encoder does it correctly), but the URL encoder that I am using keeps giving me the first encoded string above. I am using URLUTF8Encoder.java from the W3 consortium. I have to use this one as I am working on a mobile platform requiring J2ME.
Thanks!

The URL encoder at w3schools is doing it utterly wrong. The %D0%9E%D0%BA%D0%B6%D1%8D%D0%B7 is perfectly valid. That's also what I get when I do
String encoded = URLEncoder.encode("Привет", "UTF-8");
When I URL-decode the w3schools' answer as follows
String decoded = URLDecoder.decode("%26%231055%3B%26%231088%3B%26%231080%3B%26%231074%3B%26%231077%3B%26%231090%3B", "UTF-8");
then I get Привет which are exactly those Russian characters, but then converted into XML entities first.
That w3schools site is by the way in no way related to W3 Consortium. See also w3fools.

Your string "Привет" is encoded as:
%D0%9E
%D0%BA
%D0%B6
%D1%8D
%D0%B7
The second string seems to be converted into HTML entities before url-encoding:
%26%231055%3B
%26%231088%3B
%26%231080%3B
%26%231074%3B
%26%231077%3B
%26%231090%3B
%26 is &, %23 is #, %3B is ;:
П
р
и
в
е
т

Related

Encode to UTF-8. Encode character eg. ö to Ã¶

I want to encode a string in Android to UTF-8. For example this string:
Grüne Ähren beißen Flöhe
to
GrÃ¼ne Ãhren beiÃen FlÃ¶he
But no matter what I do I encode ü to ü or ü to %C3%BC (online often called 'raw URL encode').
Found solutions to convert to byte[] or URI.toASCIIString(). But non of them work for me.
UPDATE
I am participating in the eBay partner network and try to concat a searchword to my partner url.
The people of eBay must use a wrong character set, as UTF-8 URL encoded string don't work.
A searchword with UTF-8 URL encoding
(Grüne Ähren beißen Flöhe
to
Gr%C3%BCne%20%C3%84hren%20bei%C3%9Fen%20Fl%C3%B6he)
comes out to this result in the eBay searchbox:
If I encode my searchword with ISO_8859_1 it works (GrÃ¼ne Ãhren beiÃen FlÃ¶he):
Thank you very much community

What you essentially want is to convert a String to it's byte representation according to UTF-8 and interpret these bytes using a different Charset, such as ISO-8859-1.
This is usually the cause of many problems. You want to intentionally do what most developers do incorrectly (or they simply ignore the problems of charsets).
Since you just need this to work, use this piece of code:
byte[] bytes = "Grüne Ähren beißen Flöhe".getBytes("UTF-8");
String result = new String(bytes, "ISO-8859-1");
see it at work here.

String encoding - Shift_JIS / UTF-8

I get a string from a 3rd party library, which is not well encoded.
Unfortunately I'm not allowed to change the library or use another one...
So the actual problem is, that the 3rd party library result string will encode characters like "è ò à ù ì ä ö ü, ..." as SHIFT_JIS (Kanji) inside an UTF-8 string. But only if the character is connected to a word and isn't standalone.
For example:
"Ö Just a simple test"
"ÖJust a simple test"
I tried the following without success:
byte[] b = resultString.getBytes("Shift_JIS");
String value = new String(b, "UTF-8");
UPDATE 1:
That's the content of "resultString".
Note:
The byte array shown, is without any modifications (such as getBytes("Shift_JIS"), it's just the resultString as bytes)
Do you have any ideas?
Any help would be greatly appreciated.
Thank you.

Well, very strange:
As
byte[] b = resultString.getBytes("Shift_JIS");
String value = new String(b, "UTF-8");
didn't work for me I tried the following:
String value = new String(resultString.getBytes("SHIFT-JIS"), "UTF-8")
Works like a charm.
Maybe it was because of the underscore and lower case character in "Shift_JIS".

URL decoding of JS-encoded Japanese characters in Java 6

I am using encodeURIComponent in javascript(assuming this does UTF-8 encoding) to encode a variable which could contain characters like =, +, etc. This is sent as POST to my servlet where I decode it.
This works well with English but when used with Japanese string - "バスケット", this converts to some special character sequence like this - "Ã£ÂÂÃ£ÂÂ¹Ã£ÂÂ±Ã£ÂÂÃ£ÂÂ"
I am using following java 1.6 code to decode it but it doesn't work -
String ID = java.net.URLDecoder.decode(assignedID,"UTF-8");
where assignedID contains special character sequence. The above code returns me - "Ã£ÂÂÃ£ÂÂ¹Ã£ÂÂ±Ã£ÂÂÃ£ÂÂ"

In your post, is the string you're sending is being sent as part of the URL or as part of the POST body. Its mostly the part of POST body, try adding (to jsp):
<% request.setCharacterEncoding("UTF-8"); %>

Different behavior when space is encoded as + and %20 in a URL

Pages with spaces in the URL don't get correctly translated:
i.e.
http://www.streetinsider.com/Press Releases/National Trends Reflected in Plano Housing Market/9778767.html
or
http://www.streetinsider.com/Press%20Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
Gives 404. Please note "Press Releases" is encoded as "Press%20Releases".
However following two versions work fine where "Press Releases" is encoded as "Press+Releases".
http://www.streetinsider.com/Press+Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
The article parses fine with plus signs or HEX spaces %20.
http://www.streetinsider.com/Press+Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
Both + and %20 represent spaces. Then why this behavior.
And also, in java what could I use to get the correct encoded URL

Both + and %20 represent spaces
Only in query strings. Elsewhere in a URL a plus is a plus, not a space. In this case the web server gives you the same content for the two different URLs
http://www.streetinsider.com/Press+Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
and
http://www.streetinsider.com/Press+Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
but the two URLs are distinct, they're not alternative representations of the same URL.

Officially + might only be used in the query string (after ?).
This is what URLEncoder is for:
"?x=" + URLEncoder.encode("Hello World", "UTF-8");
"?x=" + URLEncoder.encode("ŝi estas ĉarma", "UTF-8");
?x=Hello+World
?x=%C5%9Di+estas+%C4%89arma
The more universal class URI, obeys the specification for spaces to be replaced, using %.
URI uri = new URI("http", "www.streetinsider.com",
"/Press Releases/National Trends Reflected in Plano Housing Market/9778767.html",
"?x=ŝi estas ĉarma");
String u = uri.toString();
http://www.streetinsider.com/Press%20Releases/National%20Trends%20
Reflected%20in%20Plano%20Housing%20Market/9778767.html#?x=ŝi%20estas%20ĉarma
One sometime encounters URI as generalisation for File and others, and then has to be careful not introducing %20 in file names.
So probably there is a partial remapping on streetinsider of + or even %20 as it seems; in order to reach the same code.

Your statement
Both + and %20 represent spaces.
is not exactly true in all cases.
Space characters may only be encoded as "+" in one context: application/x-www-form-urlencoded key-value pairs.
The RFC-1866 (HTML 2.0 specification), paragraph 8.2.1. subparagraph 1. says: "The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped").
Here is an example of such a string in URL where RFC-1866 allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses (in other cases, spaces should be encoded to %20). This way of encoding form data is also given in later HTML specifications, for example, look for relevant paragraphs about application/x-www-form-urlencoded in HTML 4.01 Specification, and so on.
The URL that you have provided is not a form data containing key/value pairs, it's just a path to a 9778767.html file:
http://www.streetinsider.com/Press%20Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
So, it is illegal to use pluses here. The correct URL in this case should have been the following:
http://www.streetinsider.com/Press%20Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html

strange characters(?) added to the end of my subject text

I have a problem with my java code sending email to users. There is some problem with the encoding of the email. When the email arrives to email account the subject line ($subject) has encoding problems as has strange characters(?) added to the end of my subject text.
The email message content itself is fine just the subject line(?) I have searched all over but cant find,after using Unicode and content type as text/html mail body have no problem with special character
(ó) but same fix is not working for subject line.
I have a class that sends an email with javamail, with a text like
this one in subject :
"Estimado Iván Escobedo:
The problem is that when the mail arrives to its destination, it
arrives this way:
"Estimado Iv?n Escobedo:
All the á, é, í, ó, ú, etc special characters are replaced with ?.
What could be the problem and how to solve it?

You should use something like that to read the message properly:
TextMessage txtMessage = (TextMessage)message;
ByteArrayInputStream bais = new ByteArrayInputStream(txtMessage.getText().getBytes ("ISO-8859-15"))
Edit :
Sanjay found the solution.
In order to set properly the message before sending, use :
MimeUtility.encodeText(SubjectText, "ISO-8859-15", "Q")
encodeText :
Encode a RFC 822 "text" token into mail-safe form as per RFC 2047.
The given Unicode string is examined for non US-ASCII characters. If the string contains only US-ASCII characters, it is returned as-is. If the string contains non US-ASCII characters, it is first character-encoded using the specified charset, then transfer-encoded using either the B or Q encoding. The resulting bytes are then returned as a Unicode string containing only ASCII characters.
Note that this method should be used to encode only "unstructured" RFC 822 headers.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java difference between two URL Encoded strings - java

Your string "Привет" is encoded as: %D0%9E %D0%BA %D0%B6 %D1%8D %D0%B7 The second string seems to be converted into HTML entities before url-encoding: %26%231055%3B %26%231088%3B %26%231080%3B %26%231074%3B %26%231077%3B %26%231090%3B %26 is &, %23 is #, %3B is ;: П р и в е т

Related

Encode to UTF-8. Encode character eg. ö to Ã¶

String encoding - Shift_JIS / UTF-8

URL decoding of JS-encoded Japanese characters in Java 6

Different behavior when space is encoded as + and %20 in a URL

strange characters(?) added to the end of my subject text

Categories

Resources