URLDecoder giving unexpected value on UTF-8 url parameter - java

I'm using java.net.URLDecoder to decode a URL parameter that is supposed to be encoded in UTF-8. A quick test reveals I'm getting ? instead of ∩ in the output. Here's the code:
System.out.println(java.net.URLDecoder.decode("A%E2%88%A9B%0AYour+answer+is%3A+3", "UTF-8"));
And as output I'm getting:
A?B
Your answer is: 3
When I plug the string A%E2%88%A9B%0AYour+answer+is%3A+3 into web decoders (e.g. here or here), they get it right:
A∩B
Your answer is: 3
Does anyone know what I'm doing wrong. Is this not actually UTF-8? The string is coming from com.google.gwt.http.client.URL.encodeQueryString(), which claims UTF-8 encoding.

As Siguza and VGe0rge pointed out, the Java code was running correctly, but Eclipse's console will not display in UTF-8 by default. A solution to that issue can be found here.

Related

french accents giving "<?>" in http responses with correct charset (Java)

Calling an API that returns french sentences, all the accented characters are displayed like <?> in my java code, even if the charset is well defined (application/json;charset=iso-8859-1).
Using postman or my web browser, I don't face any problem.
I also tried to call the API with a Content-Type header with the value application/json;charset=UTF-8 or application/json;charset=iso-8859-1 but the problem remains the same.
Any idea ?
response.getBody() gives:
{"sentences":[{"fr_value":"il �tait loin","dz_value":"kaan b3id","additional_information":{"personal_prounoun":"HE","verb":"�tre","adjective":"loin","tense":"pass�"}}],"count":1}
new String(response.getBody().getBytes(StandardCharsets.UTF_8)) gives exactly the same.
I'm using scribejava.
Edit: even saving the response in a file and opening it with NotePad++, the result is similar:
You need to read it as ISO-8859-1. Not sure what then as I don't know what you're doing. My https://technojeeves.com/index.php/aliasjava1/51-transcode-in-java is helpful. With wget:
wget -O - us-central1-dz-dialect-api.cl… | xcode -ie Latin1 (I made 'xcode' to invoke that Java app)
Problem solved using the following code :
httpResponse.setContentType("application/json;charset=UTF-8");
mapper.getFactory().configure(JsonGenerator.Feature.ESCAPE_NON_ASCII, true);

Unexpected output from URLEncoder in Java

I am trying to encode a URL parameter.
For example when I am encoding
qOddENxeLxL+13drGKYUgA==\n
using URL Encoder tool
It gives the following output which works when I request API
qOddENxeLxL%2B13drGKYUgA%3D%3D%5Cn
But when I am encoding URL from my Java code (Android) using URLEncoder.encode("qOddENxeLxL+13drGKYUgA==\n", "UTF-8");
It gives me the following result
qOddENxeLxL%252B13drGKYUgA%253D%253D%250A
I tried using other Encoding schemes too but could not produce the same result.
The issue is because the \n is being interpreted as a new line character. Java will treat \ inside a string as starting an escape sequence.
You have to escape it in order to get the same thing as in the URL you provided.
System.out.println(URLEncoder.encode("qOddENxeLxL+13drGKYUgA==\\n", "UTF-8"));
This will provide the same result:
qOddENxeLxL%2B13drGKYUgA%3D%3D%5Cn
The issue is that you are feeding \n to the URLEncoder tool, which doesn't understand it as an escape sequence and so gives you %5Cn, and to the Java compiler inside a string literal, which does understand it and so gives you 0x0A.
Figured out the issue, here string was getting encoded two times.
While passing parameter to Retrofit call it is getting encoded automatically by retrofit and I was passing encoded parameter to retrofit so it got encoded again.
BTW thanks for the explanations. :)

Gradle/Eclipse: Different behavior of german "Umlaute" when using equality?

I am experiencing a weird behavior with german "Umlaute" (ä, ö, ü, ß) when using Java's equality checks (either directly or indirectly.
Everything works as expected when running, debugging or testing from Eclipse and input containing "Umlaute" is treated as equal or not as expected.
However when I build the application using Spring Boot and run it, these equality checks fail for words that contain "Umlaute", i.e. for words like "Nationalität".
Input is retrieved from a webpage via Jsoup and content of a table is extracted for some keywords. The encoding of the page is UTF-8 and I have handling in place for Jsoup to convert it if this is not the case.
The encoding of the source files is UTF-8 as well.
Connection connection = Jsoup.connect(url)
.header("accept-language", "de-de, de, en")
.userAgent("Mozilla/5.0")
.timeout(10000)
.method(Method.GET);
Response response = connection.execute();
if(logger.isDebugEnabled())
logger.debug("Encoding of response: " +response.charset());
Document doc;
if(response.charset().equalsIgnoreCase("UTF-8"))
{
logger.debug("Response has expected charset");
doc = Jsoup.parse(response.body(), baseURL);
}
else
{
logger.debug("Response doesn't have exepcted charset and is converted");
doc = Jsoup.parse(new String(response.bodyAsBytes(), "UTF-8"), baseURL);
}
logger.debug("Encoding of document: " +doc.charset());
if(!doc.charset().equals(Charset.forName("UTF-8")))
{
logger.debug("Changing encoding of document from " +doc.charset());
doc.updateMetaCharsetElement(true);
doc.charset(Charset.forName("UTF-8"));
logger.debug("Changed encoding of document to: " +doc.charset());
}
return doc;
Example log output (from deployed app) of reading content.
Encoding of response: utf-8
Response has expected charset
Encoding of document: UTF-8
Example input:
<tr><th>Nationalität:</th> <td> [...] </td> </tr>
Example code that fails for words containing ä, ö, ü or ß but works fine for other words:
Element header = row.select("th").first();
String text = header.ownText();
if("Nationalität:".equals(text))
{
// goes here in eclipse
}
else
{
// and here in deployed spring boot app
}
Is there any difference between running from Eclipse and a built & deployed app that I am missing? Where else could this behavior come from and how I this be resolved?
As far as I can see this is not (directly) an encoding issue since the input shows "Umlaute" correctly...
Since this is not reproducible when debugging, I am having a hard time figuring out what exactly goes wrong.
Edit: While input looks fine in logs (i.e. diacritics show up correctly) I realized that they don't look correct in the console:
<th>Nationalität:</th>
I am currently using a Normalizer as suggested by Mirko like this:
Normalizer.normalize(input, Form.NFC);
(also tried it with NFD).
How do (SpringBoot-) console and (logback) logoutput differ?
Diacritics like umlauts can often be represented in two different ways in unicode: As a single-codepoint character or as a composition of two characters. This isn't a problem of the encoding, it can happen in UTF-8, UTF-16, UTF-32 etc.
Java's equals method may not consider composite characters equal to single-codepoint characters, even though they look exactly the same.
Try to have a look at the binary representation of the strings you are comparing, this way you should be able to track down the differences.
You could also use the methods of the "Character" class to iterate through the strings and print out the properties of all the characters. Maybe this helps, too, to figure out differences.
In any case, it could help if you use java.text.Normalizer on both "sides" of the "equals", to normalize the text to, for example, Unicode Normalization Form C. This way, differences like the aforementioned should be straightened out and the strings should compare as expected.
Have you tried printing the keycode to console to see if they actually match when compiled? Maybe Eclipse is handling the charset gracefully but when it's compiled it's down to some Java/System settings?
I think I tracked this down to the build of the standalone app being the culprit.
As described above, when running from Eclipse all is fine, the problem only occurred when I ran the standalone Spring Boot app.
This is being built with Gradle. In my build.gradle I have
compileJava.options.encoding = 'UTF-8'
in order to force UTF-8 being used for encoding. This should (usually) be enough. I however also use AspectJ (via gradle-aspectj plugin) which apparently breaks this behavior (involuntarily?) and results in a default encoding to be used instead of the one explicitly defined.
In order to solve this I added
compileAspect {
additionalAjcArgs = ['encoding' : 'UTF-8']
}
to my build.gradle which passes the encoding option on to the ajc compiler. This seems to have fixed the problem for the regular build.
The problem still occurs however when tests are run from gradle. I was not yet able to find out what needs to be done there and why the above configuration is not enough.
This is now tracked in a separate question.

UTF-8 encoding of GET parameters in JSF

I have a search form in JSF that is implemented using a RichFaces 4 autocomplete component and the following JSF 2 page and Java bean. I use Tomcat 6 & 7 to run the application.
...
<h:commandButton value="#{msg.search}" styleClass="search-btn" action="#{autoCompletBean.doSearch}" />
...
In the AutoCompleteBean
public String doSearch() {
//some logic here
return "/path/to/page/with/multiple_results?query=" + searchQuery + "&faces-redirect=true";
}
This works well as long as everything withing the "searchQuery" String is in Latin-1, it does not work if is outside of Latin-1.
For instance a search for "bodø" will be automatically encoded as "bod%F8". However a search for "Kra Ðong" will not work since it is unable to encode "Ð".
I have now tried several different approaches to solve this, but none of them works.
I have tried encoding the searchQuery my self using URLEncode, but this only leads to double encoding since % is encoded as %25.
I have tried using java.net.URI to get the encoding, but gives the same result as URLEncode.
I have tried turning on UTF-8 in Tomcat using URIEncoding="UTF-8" in the Connector but this only worsens that problem since then non-ascii characters does not work at all.
So to my questions:
Can I change the way JSF 2 encodes the GET parameters?
If I cannot change the way JSF 2 encodes the GET parameters, can I turn of the encoding and do it manually?
Am I doing something where strange here? This seems like something that should be supported out-of-the-box, but I cannot find any others with the same problem.
I think you've hit a corner case bug in JSF. The query string is URL-encoded by ExternalContext#encodeRedirectURL() which uses the response character encoding as obtained by ExternalContext#getResponseCharacterEncoding(). However, while JSF by default uses UTF-8 as response character encoding, this is only been set if the view is actually to be rendered, not when the response is to be redirected, so the response character encoding still returns the platform default of ISO-8859-1 which causes your characters to be URL-encoded using this wrong encoding.
I've reported this as issue 2440. In the meanwhile your best bet is to explicitly set the response character encoding yourself beforehand.
FacesContext.getCurrentInstance().getExternalContext().setResponseCharacterEncoding("UTF-8");
Note that this still requires that the container itself uses the same character encoding to decode the request URL, so you certainly need to set URIEncoding="UTF-8" in Tomcat's configuration. This won't mess up the characters anymore as they will be really UTF-8 now.
The only character encoding accepted for HTTP URLs and headers is US-ASCII, you need to URL encode these characters to send them back to the application. Simplest way to do this in java would be:
public String doSearch() {
//some logic here
String encodedSearchQuery = java.net.URLEncoder.encode( searchQuery, "UTF-8" );
return "/path/to/page/with/multiple_results?query=" + encodedSearchQuery + "&faces-redirect=true";
}
And then it should work for any character that you use.

Converting XML to JSON results in unknown characters when running on Centos instead of Windows

I have a Java servlet which gets RSS feeds converts them to JSON. It works great on Windows, but it fails on Centos.
The RSS feed contains Arabic and it shows unintelligible characters on Centos. I am using those lines to encode the RSS feed:
byte[] utf8Bytes = Xml.getBytes("Cp1256");
// byte[] defaultBytes = Xml.getBytes();
String roundTrip = new String(utf8Bytes, "UTF-8");
I tried it on Glassfish and Tomcat. Both have the same problem; it works on Windows, but fails on Centos. How is this caused and how can I solve it?
byte[] utf8Bytes = Xml.getBytes("Cp1256");
String roundTrip = new String(utf8Bytes, "UTF-8");
This is an attempt to correct a badly-decoded string. At some point prior to this operation you have read in Xml using the default encoding, which on your Windows box is code page 1256 (Windows Arabic). Here you are encoding that string back to code page 1256 to retrieve its original bytes, then decoding it properly as the encoding you actually wanted, UTF-8.
On your Linux server, it fails, because the default encoding is something other than Cp1256; it would also fail on any Windows server not installed in an Arabic locale.
The commented-out line that uses the default encoding instead of explicitly Cp1256 is more likely to work on a Linux server. However, the real fix is to find where Xml is being read, and fix that operation to use the correct encoding(*) instead of the default. Allowing the default encoding to be used is almost always a mistake, as it makes applications dependent on configuration that varies between servers.
(*: for this feed, that's UTF-8, which is the most common encoding, but it may differ for others. Finding out the right encoding for a feed depends on the Content-Type header returned for the resource and the <?xml encoding declaration. By far the best way to cope with this is to fetch and parse the resource using a proper XML library that knows about this, for example with DocumentBuilder.parse(uri).)
There are many places where wrong encoding can be used. Here is the complete list http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8

Categories

Resources