I am experiencing a weird behavior with german "Umlaute" (ä, ö, ü, ß) when using Java's equality checks (either directly or indirectly.
Everything works as expected when running, debugging or testing from Eclipse and input containing "Umlaute" is treated as equal or not as expected.
However when I build the application using Spring Boot and run it, these equality checks fail for words that contain "Umlaute", i.e. for words like "Nationalität".
Input is retrieved from a webpage via Jsoup and content of a table is extracted for some keywords. The encoding of the page is UTF-8 and I have handling in place for Jsoup to convert it if this is not the case.
The encoding of the source files is UTF-8 as well.
Connection connection = Jsoup.connect(url)
.header("accept-language", "de-de, de, en")
.userAgent("Mozilla/5.0")
.timeout(10000)
.method(Method.GET);
Response response = connection.execute();
if(logger.isDebugEnabled())
logger.debug("Encoding of response: " +response.charset());
Document doc;
if(response.charset().equalsIgnoreCase("UTF-8"))
{
logger.debug("Response has expected charset");
doc = Jsoup.parse(response.body(), baseURL);
}
else
{
logger.debug("Response doesn't have exepcted charset and is converted");
doc = Jsoup.parse(new String(response.bodyAsBytes(), "UTF-8"), baseURL);
}
logger.debug("Encoding of document: " +doc.charset());
if(!doc.charset().equals(Charset.forName("UTF-8")))
{
logger.debug("Changing encoding of document from " +doc.charset());
doc.updateMetaCharsetElement(true);
doc.charset(Charset.forName("UTF-8"));
logger.debug("Changed encoding of document to: " +doc.charset());
}
return doc;
Example log output (from deployed app) of reading content.
Encoding of response: utf-8
Response has expected charset
Encoding of document: UTF-8
Example input:
<tr><th>Nationalität:</th> <td> [...] </td> </tr>
Example code that fails for words containing ä, ö, ü or ß but works fine for other words:
Element header = row.select("th").first();
String text = header.ownText();
if("Nationalität:".equals(text))
{
// goes here in eclipse
}
else
{
// and here in deployed spring boot app
}
Is there any difference between running from Eclipse and a built & deployed app that I am missing? Where else could this behavior come from and how I this be resolved?
As far as I can see this is not (directly) an encoding issue since the input shows "Umlaute" correctly...
Since this is not reproducible when debugging, I am having a hard time figuring out what exactly goes wrong.
Edit: While input looks fine in logs (i.e. diacritics show up correctly) I realized that they don't look correct in the console:
<th>Nationalität:</th>
I am currently using a Normalizer as suggested by Mirko like this:
Normalizer.normalize(input, Form.NFC);
(also tried it with NFD).
How do (SpringBoot-) console and (logback) logoutput differ?
Diacritics like umlauts can often be represented in two different ways in unicode: As a single-codepoint character or as a composition of two characters. This isn't a problem of the encoding, it can happen in UTF-8, UTF-16, UTF-32 etc.
Java's equals method may not consider composite characters equal to single-codepoint characters, even though they look exactly the same.
Try to have a look at the binary representation of the strings you are comparing, this way you should be able to track down the differences.
You could also use the methods of the "Character" class to iterate through the strings and print out the properties of all the characters. Maybe this helps, too, to figure out differences.
In any case, it could help if you use java.text.Normalizer on both "sides" of the "equals", to normalize the text to, for example, Unicode Normalization Form C. This way, differences like the aforementioned should be straightened out and the strings should compare as expected.
Have you tried printing the keycode to console to see if they actually match when compiled? Maybe Eclipse is handling the charset gracefully but when it's compiled it's down to some Java/System settings?
I think I tracked this down to the build of the standalone app being the culprit.
As described above, when running from Eclipse all is fine, the problem only occurred when I ran the standalone Spring Boot app.
This is being built with Gradle. In my build.gradle I have
compileJava.options.encoding = 'UTF-8'
in order to force UTF-8 being used for encoding. This should (usually) be enough. I however also use AspectJ (via gradle-aspectj plugin) which apparently breaks this behavior (involuntarily?) and results in a default encoding to be used instead of the one explicitly defined.
In order to solve this I added
compileAspect {
additionalAjcArgs = ['encoding' : 'UTF-8']
}
to my build.gradle which passes the encoding option on to the ajc compiler. This seems to have fixed the problem for the regular build.
The problem still occurs however when tests are run from gradle. I was not yet able to find out what needs to be done there and why the above configuration is not enough.
This is now tracked in a separate question.
Related
I'm working with iText5 to parse a pdf written mostly in Hebrew.
To extract the text I use PdfTextExtractor.getTextFromPage. I didn't find a way to change the encoding in the library and the text appears in gibberish.
I tried to fix the encoding like this:
new String(pdfPage.getBytes(Charset1), Charset2).
I went through all possible charsets using Charset.availableCharsets() and few of them gave me Hebrew instead of gibberish but reversed.
Now I thought I can reverse the text line by line, but Hebrew it right to left and number and English are left to right. So if I reverse the line, it fixes the Hebrew but breaks the numbers/English.
Example:
PdfTextExtractor.getTextFromPage returns 87.55 úåáééçúä ééåëéð ë"äñ
new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255")) returns 87.55 תובייחתה ייוכינ כ"הס
if I reverse this then I get סה"כ ניכויי התחייבות 55.78
The number should be 87.55 and not 55.78
The only solution I found is to split it into Hebrew and the rest (English/numbers) and reverse only the Hebrew parts and then merge it back.
Isn't there an easier solution? I feel like I'm missing something with the encoding/RTL
I cant share the document I'm working on because it contains PII. But after searching Goole for pdf with gibberish, I found this document - the last paragraph of the document has exactly the same problem I have in my documents.
I can only analyze the data given, so in this case only the linked government paper from which
is extracted as
ìëéî ìù "íééç éøåùéë" øôñá ,äéãôåìòôäá íéáø úåðåéòø ãåò àåöîì ïúéð
.ãåòå úéëåðéçä äééæëøîá ,567 'îò ,ïîöìæ éìéðå ì÷ðøô äéæø ,ïîæåø
And in this case the reason for the gibberish output is simple: The PDF claims that this gibberish is indeed the text there!
Thus, the problem is not the text extractor, be it the iText PdfTextExtractor, Adobe Reader copy&paste, or whichever. Instead the problem is the document which lies about its contents
In more detail
The font TT1 used for this paragraph has a ToUnicode entry with the following mappings:
28 beginbfchar
<0003> <0020>
<0005> <0022>
<000a> <0027>
<000f> <002C>
<0011> <002E>
<001d> <003A>
<0069> <00E1>
<006a> <00E0>
<006b> <00E2>
<006c> <00E4>
<006d> <00E3>
<006e> <00E5>
<006f> <00E7>
<0070> <00E9>
<0071> <00E8>
<0074> <00ED>
<0075> <00EC>
<0078> <00F1>
<0079> <00F3>
<007a> <00F2>
<007b> <00F4>
<007c> <00F6>
<007e> <00FA>
<007f> <00F9>
<0096> <00E6>
<0097> <00F8>
<00ab> <00F7>
<00d5> <00F0>
endbfchar
3 beginbfrange
<0018> <001a> <0035>
<0072> <0073> <00EA>
<0076> <0077> <00EE>
endbfrange
I.e. all codes are mapped to Unicode values between U+0020 and U+00F9, a Unicode range in which clearly the Hebrew characters one sees in the screen shot are not located. More exactly: aside from space, some punctuation, and digits (which are extracted correctly) the values are in the range between U+00E0 and U+00F9, a region where Latin letters with accents and their ilk are located.
You mention that in some case you could retrieve the Hebrew text by applying
new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255"))
So probably the PDF creation tool has put mappings to the Windows-1255 codepage into the ToUnicode map. Which obviously is wrong, the ToUnicode map must contain mappings to Unicode.
That all been said, even if the ToUnicode mappings were correct, you'd still have to fight with reversed Hebrew output. This indeed is a limitation of iText 5.x text extraction, it has no special support for RTL languages. Thus, you'll have to change the order of the characters in the result string yourself.
In this answer you'll find an example of such a re-ordering method. It is for Arabic script and it is in C# but it clearly shows how to proceed.
First of all a most appropriate Hebrew byte character set is "ISO-8859-8" (better then windows-1255). try to play with this. Also, I would try to extract String using charset UTF-8. Also there is a great diagnostic tool that helped me to diagnose and resolve countless thorny encoding issues related to Hebrew and Arabic There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:
result = "שלום את";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u05e9\u05dc\u05d5\u05dd\u0020\u05d0\u05ea
שלום את
Here is javadoc for the class StringUnicodeEncoderDecoder As you can see the Unicode symbols for Hebrew is U+05** where the first Hebrew letter (Alef -א) is U+05d0 and the last Hebrew letter (Tav - ת) is U+05ea. The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadocSo what I would do first is to get your original String and convert it to unicode sequence and see what you actually getting there. If the data is not correct then try to extract bytes and build a string with UTF-8. Anyway I would strongly recommend to use this utility as it helped me many times.
Using ICU did the job:
Bidi bidi = new Bidi();
bidi.setPara(input, Bidi.RTL, null);
String output = bidi.writeReordered(Bidi.DO_MIRRORING);
StringBuffer messageText = new StringBuffer();
messageText.append("<style type=\"text/css\">" +
"#message p {some style }" +
"</style>");
messageText.append("<p>");
(L1)messageText.append("abc’s email level…def");
messageText.append("</p>");
message.setContent(messageText.toString(), "text/html;");
Transport.send(message);
When i ran the code found two different variations of the output.
I first typed this message abc’s email level…def in the microsoft word, then copied this to the eclipse editor. when i run the program message that was in email is something different like this abc?s email level?def
But when i type this message abc’s email level…def in the eclipse editor then I am seeing the same message in email.
What should I change in the code to receive the same message in the email even if i copy something from microsoft word...
This is almost certainly an encoding problem between your editors (MS-Word and Eclipse, in this case) and your program. You'll want to verify that the content you are copying and pasting from MS-Word to eclipse is UTF-8 on both sides, I suspect that it is not.
The commenter is right that this problem is problem microsoft's smart quotes, which don't generally paste correctly, you can write a regular expression to replace them; but this is a specific work around for those particular characters, and will not handle a generic case.
The root cause is almost certainly an encoding mismatch between what you are pasting from MS-Word, and what your java code expects. You can check your eclipse settings to verify you are using UTF-8 as a default, check your word settings to verify the source document is also UTF-8.
I have a search form in JSF that is implemented using a RichFaces 4 autocomplete component and the following JSF 2 page and Java bean. I use Tomcat 6 & 7 to run the application.
...
<h:commandButton value="#{msg.search}" styleClass="search-btn" action="#{autoCompletBean.doSearch}" />
...
In the AutoCompleteBean
public String doSearch() {
//some logic here
return "/path/to/page/with/multiple_results?query=" + searchQuery + "&faces-redirect=true";
}
This works well as long as everything withing the "searchQuery" String is in Latin-1, it does not work if is outside of Latin-1.
For instance a search for "bodø" will be automatically encoded as "bod%F8". However a search for "Kra Ðong" will not work since it is unable to encode "Ð".
I have now tried several different approaches to solve this, but none of them works.
I have tried encoding the searchQuery my self using URLEncode, but this only leads to double encoding since % is encoded as %25.
I have tried using java.net.URI to get the encoding, but gives the same result as URLEncode.
I have tried turning on UTF-8 in Tomcat using URIEncoding="UTF-8" in the Connector but this only worsens that problem since then non-ascii characters does not work at all.
So to my questions:
Can I change the way JSF 2 encodes the GET parameters?
If I cannot change the way JSF 2 encodes the GET parameters, can I turn of the encoding and do it manually?
Am I doing something where strange here? This seems like something that should be supported out-of-the-box, but I cannot find any others with the same problem.
I think you've hit a corner case bug in JSF. The query string is URL-encoded by ExternalContext#encodeRedirectURL() which uses the response character encoding as obtained by ExternalContext#getResponseCharacterEncoding(). However, while JSF by default uses UTF-8 as response character encoding, this is only been set if the view is actually to be rendered, not when the response is to be redirected, so the response character encoding still returns the platform default of ISO-8859-1 which causes your characters to be URL-encoded using this wrong encoding.
I've reported this as issue 2440. In the meanwhile your best bet is to explicitly set the response character encoding yourself beforehand.
FacesContext.getCurrentInstance().getExternalContext().setResponseCharacterEncoding("UTF-8");
Note that this still requires that the container itself uses the same character encoding to decode the request URL, so you certainly need to set URIEncoding="UTF-8" in Tomcat's configuration. This won't mess up the characters anymore as they will be really UTF-8 now.
The only character encoding accepted for HTTP URLs and headers is US-ASCII, you need to URL encode these characters to send them back to the application. Simplest way to do this in java would be:
public String doSearch() {
//some logic here
String encodedSearchQuery = java.net.URLEncoder.encode( searchQuery, "UTF-8" );
return "/path/to/page/with/multiple_results?query=" + encodedSearchQuery + "&faces-redirect=true";
}
And then it should work for any character that you use.
I have a Java servlet which gets RSS feeds converts them to JSON. It works great on Windows, but it fails on Centos.
The RSS feed contains Arabic and it shows unintelligible characters on Centos. I am using those lines to encode the RSS feed:
byte[] utf8Bytes = Xml.getBytes("Cp1256");
// byte[] defaultBytes = Xml.getBytes();
String roundTrip = new String(utf8Bytes, "UTF-8");
I tried it on Glassfish and Tomcat. Both have the same problem; it works on Windows, but fails on Centos. How is this caused and how can I solve it?
byte[] utf8Bytes = Xml.getBytes("Cp1256");
String roundTrip = new String(utf8Bytes, "UTF-8");
This is an attempt to correct a badly-decoded string. At some point prior to this operation you have read in Xml using the default encoding, which on your Windows box is code page 1256 (Windows Arabic). Here you are encoding that string back to code page 1256 to retrieve its original bytes, then decoding it properly as the encoding you actually wanted, UTF-8.
On your Linux server, it fails, because the default encoding is something other than Cp1256; it would also fail on any Windows server not installed in an Arabic locale.
The commented-out line that uses the default encoding instead of explicitly Cp1256 is more likely to work on a Linux server. However, the real fix is to find where Xml is being read, and fix that operation to use the correct encoding(*) instead of the default. Allowing the default encoding to be used is almost always a mistake, as it makes applications dependent on configuration that varies between servers.
(*: for this feed, that's UTF-8, which is the most common encoding, but it may differ for others. Finding out the right encoding for a feed depends on the Content-Type header returned for the resource and the <?xml encoding declaration. By far the best way to cope with this is to fetch and parse the resource using a proper XML library that knows about this, for example with DocumentBuilder.parse(uri).)
There are many places where wrong encoding can be used. Here is the complete list http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8
I'm using HtmlCleaner to scrape a ISO-8859-1 encoded web site in Android.
I've implemented this in an external jar file that I import into my Android app.
When I run the unit tests in Eclipse it handles Norwegian letters (æ,ø,å) correct (I can verify that in the debugger), but in the Android app these characters look like inverted question marks.
If I attach the debugger to my Android app I can see that these letters are not correct in the exact same places they were good when running unit test from Eclipse, so it's not a display/render/view issue in the Android app.
When I copy the text from the debuggers I get these results:
Java Process (Unit Test): «Blårek», «Benny»
Android Process (In emulator): «Bl�rek», «Benny»
I would expect these Strings to be equal, but notice how the "å" is replaed by the inverted question marks in Android.
I have tried running htmlCleaner.getProperties().setRecognizeUnicodeChars(true) without any luck. Also, I found no way of forcing UTF-8 or ISO-8859-1 encoding in html cleaner, but I' not sure if that would have made a difference.
Here is the code i run:
HtmlCleaner htmlCleaner = new HtmlCleaner();
// connect to url and get root TagNode from HtmlCleaner
InputSteram is = new URL( url ).openConnection().getInputStream();
TagNode rootNode = htmlCleaner.clean( is );
// navigate through some TagNodes, getting the ContentNode
ContentNode cn = rootNode...
// This String contains the incorrectly decoded characters on Android.
// Good in Oracle JVM though..
String value = cn.toString().trim();
Does anyone knows what could cause the decoding behavoir to be different on Android? I guess the main difference between the two environments is that the Android app uses Android's java.io stack while my unit tests use Sun/Oracle's stack.
Thanks,
Geir
HtmlCleaner can't tell what encoding to use; you are passing only the body of the response in the InputStream, but the encoding is in the "content-type" header.
You can set the character encoding on the properties of the HtmlCleaner to the correct encoding from the HTTP connection. But that would require you to parse the correct parameter from the content-type header. Alternatively, you can pass a URL instance to HtmlCleaner and let it manage the connection. Then, it will have access to all the information it needs to decode properly.