StringBuffer messageText = new StringBuffer();
messageText.append("<style type=\"text/css\">" +
"#message p {some style }" +
"</style>");
messageText.append("<p>");
(L1)messageText.append("abc’s email level…def");
messageText.append("</p>");
message.setContent(messageText.toString(), "text/html;");
Transport.send(message);
When i ran the code found two different variations of the output.
I first typed this message abc’s email level…def in the microsoft word, then copied this to the eclipse editor. when i run the program message that was in email is something different like this abc?s email level?def
But when i type this message abc’s email level…def in the eclipse editor then I am seeing the same message in email.
What should I change in the code to receive the same message in the email even if i copy something from microsoft word...
This is almost certainly an encoding problem between your editors (MS-Word and Eclipse, in this case) and your program. You'll want to verify that the content you are copying and pasting from MS-Word to eclipse is UTF-8 on both sides, I suspect that it is not.
The commenter is right that this problem is problem microsoft's smart quotes, which don't generally paste correctly, you can write a regular expression to replace them; but this is a specific work around for those particular characters, and will not handle a generic case.
The root cause is almost certainly an encoding mismatch between what you are pasting from MS-Word, and what your java code expects. You can check your eclipse settings to verify you are using UTF-8 as a default, check your word settings to verify the source document is also UTF-8.
Related
I want to download the text file by clicking on button, everything is working fine as expected. But the problem is the data I want to insert in text file is just one line.
String fileContent = "Simple Solution \nDownload Example 1";
here, \n is not working. It resulting in output as:
Simple Solution Download Example 1
Code snippets:
interface:
interface implementation in my service class:
controller:
Don't use hardcoded \n nor \r\n - line-separators are platform-specific (Windows differs to all other OS).
What you can do is:
Use System.lineSeparator()
Build content with String.format() and replace \n with %n
The main problem is that the server computer and client computer are basically independent with respect to character set encoding and line separators.
Defaults will not do.
As we are living in a Windows centric world (I am a linuxer), user "\r\n".
Then java can mix any Unicode script. A file does not have info on its encoding.
If it originates on an other computer/platform, that raises problems.
String fileContent = "Simple Solution façade, mañana, €\r\n"
+ "Download Обичам ĉĝĥĵŝŭ Example 1";
So the originating computer explicitly define the encoding. It should not do:
fileContent.getBytes(); // Default platform encoding Charset.defaultCharset().
So the originating computer can do:
fileContent.getBytes(StandardCharsets.UTF_8); // UTF-8, full Unicode.
fileContent.getBytes("Windows-1252); // MS Windows Latin 1, some ? failures.
The contentType can be set appropriately with "text/plain;charset=UTF-8" or for Windows-1252 "text/plain;charset=ISO-8859-1".
And from that byte[] you should take the .length for the contentLength.
Writing to the file can use Files.writeString
In that case use Files.size(exportedPath) for the content length.
Files.newInputStream(exportedPath) is the third goodie from Files.
I am experiencing a weird behavior with german "Umlaute" (ä, ö, ü, ß) when using Java's equality checks (either directly or indirectly.
Everything works as expected when running, debugging or testing from Eclipse and input containing "Umlaute" is treated as equal or not as expected.
However when I build the application using Spring Boot and run it, these equality checks fail for words that contain "Umlaute", i.e. for words like "Nationalität".
Input is retrieved from a webpage via Jsoup and content of a table is extracted for some keywords. The encoding of the page is UTF-8 and I have handling in place for Jsoup to convert it if this is not the case.
The encoding of the source files is UTF-8 as well.
Connection connection = Jsoup.connect(url)
.header("accept-language", "de-de, de, en")
.userAgent("Mozilla/5.0")
.timeout(10000)
.method(Method.GET);
Response response = connection.execute();
if(logger.isDebugEnabled())
logger.debug("Encoding of response: " +response.charset());
Document doc;
if(response.charset().equalsIgnoreCase("UTF-8"))
{
logger.debug("Response has expected charset");
doc = Jsoup.parse(response.body(), baseURL);
}
else
{
logger.debug("Response doesn't have exepcted charset and is converted");
doc = Jsoup.parse(new String(response.bodyAsBytes(), "UTF-8"), baseURL);
}
logger.debug("Encoding of document: " +doc.charset());
if(!doc.charset().equals(Charset.forName("UTF-8")))
{
logger.debug("Changing encoding of document from " +doc.charset());
doc.updateMetaCharsetElement(true);
doc.charset(Charset.forName("UTF-8"));
logger.debug("Changed encoding of document to: " +doc.charset());
}
return doc;
Example log output (from deployed app) of reading content.
Encoding of response: utf-8
Response has expected charset
Encoding of document: UTF-8
Example input:
<tr><th>Nationalität:</th> <td> [...] </td> </tr>
Example code that fails for words containing ä, ö, ü or ß but works fine for other words:
Element header = row.select("th").first();
String text = header.ownText();
if("Nationalität:".equals(text))
{
// goes here in eclipse
}
else
{
// and here in deployed spring boot app
}
Is there any difference between running from Eclipse and a built & deployed app that I am missing? Where else could this behavior come from and how I this be resolved?
As far as I can see this is not (directly) an encoding issue since the input shows "Umlaute" correctly...
Since this is not reproducible when debugging, I am having a hard time figuring out what exactly goes wrong.
Edit: While input looks fine in logs (i.e. diacritics show up correctly) I realized that they don't look correct in the console:
<th>Nationalität:</th>
I am currently using a Normalizer as suggested by Mirko like this:
Normalizer.normalize(input, Form.NFC);
(also tried it with NFD).
How do (SpringBoot-) console and (logback) logoutput differ?
Diacritics like umlauts can often be represented in two different ways in unicode: As a single-codepoint character or as a composition of two characters. This isn't a problem of the encoding, it can happen in UTF-8, UTF-16, UTF-32 etc.
Java's equals method may not consider composite characters equal to single-codepoint characters, even though they look exactly the same.
Try to have a look at the binary representation of the strings you are comparing, this way you should be able to track down the differences.
You could also use the methods of the "Character" class to iterate through the strings and print out the properties of all the characters. Maybe this helps, too, to figure out differences.
In any case, it could help if you use java.text.Normalizer on both "sides" of the "equals", to normalize the text to, for example, Unicode Normalization Form C. This way, differences like the aforementioned should be straightened out and the strings should compare as expected.
Have you tried printing the keycode to console to see if they actually match when compiled? Maybe Eclipse is handling the charset gracefully but when it's compiled it's down to some Java/System settings?
I think I tracked this down to the build of the standalone app being the culprit.
As described above, when running from Eclipse all is fine, the problem only occurred when I ran the standalone Spring Boot app.
This is being built with Gradle. In my build.gradle I have
compileJava.options.encoding = 'UTF-8'
in order to force UTF-8 being used for encoding. This should (usually) be enough. I however also use AspectJ (via gradle-aspectj plugin) which apparently breaks this behavior (involuntarily?) and results in a default encoding to be used instead of the one explicitly defined.
In order to solve this I added
compileAspect {
additionalAjcArgs = ['encoding' : 'UTF-8']
}
to my build.gradle which passes the encoding option on to the ajc compiler. This seems to have fixed the problem for the regular build.
The problem still occurs however when tests are run from gradle. I was not yet able to find out what needs to be done there and why the above configuration is not enough.
This is now tracked in a separate question.
Background:
I have 2 machines: one is running German windows 7 and my PC running English(with Hebrew locale) windows 7.
In my Perl code I'm trying to check if the file that I got from the German machine exists on my machine.
The file name is ßßßzllpoöäüljiznppü.txt
Why is it failed when I do the following code:
use Encode;
use Encode::locale;
sub UTF8ToLocale
{
my $str = decode("utf8",$_[0]);
return encode(locale, $str);
}
if(!-e UTF8ToLocale($read_file))
{
print "failed to open the file";
}
else
{
print $read_file;
}
Same thing goes also when I'm trying to open the file:
open (wtFile, ">", UTF8ToLocale($read_file));
binmode wtFile;
shift #_;
print wtFile #_;
close wtFile;
The file name is converted from German to utf8 in my java application and this is passed to the perl script.
The perl script takes this file name and convert it from utf8 to the system locale, see UTF8ToLocale($read_file) function call, and I believe that is the problem.
Questions:
Can you please tell me what is the OS file system charset encoding?
When I create German file name in OS that the locale is Hebrew in which Charset is it saved?
How do I solve this problem?
Update:
Here is another code that I run with hard coded file name on my PC, the script file is utf8 encoded:
use Encode;
use Encode::locale;
my $string = encode("utf-16",decode("utf8","C:\\TestPerl\\ßßßzllpoöäüljiznppü.txt"));
if (-e $string)
{
print "exists\r\n";
}
else
{
print "not exists\r\n"
}
The output is "not exists".
I also tried different charsets: cp1252, cp850, utf-16le, nothing works.
If I'm changing the file name to English or Hebrew(my default locale) it works.
Any ideas?
Windows 7 uses UTF-16 internally [citation needed] (I don't remember the byte order). You don't need to convert file names because of that. However, if you transport the file via a FAT file system (eg an old USB stick) or other non Unicode aware file systems these benefits will get lost.
The locale setting you are talking about only affects the language of the user interface and the apparent folder names (Programme (x86) vs. Program Files (x86) with the latter being the real name in the file system).
The larger problem I can see is the internal encoding of the file contents that you want to transfer as some applications may default to different encodings depending on the locale. There is no solution to that except being explicit when the file is created. Sticking to UTF-8 is generally a good idea.
And why do you convert the file names with another tool? Any Unicode encoding should be sufficient for transfer.
Your script does not work because you reference an undefined global variable called $read_file. Assuming your second code block is not enclosed in any scope, especially not in a sub, then the #_ variable is not available. To get command line arguments you should consider using the #ARGV array. The logic ouf your script isn't clear anyway: You print error messages to STDOUT, not STDERR, you "decode" the file name and then print out the non-decoded string in your else-branch, you are paranoid about encodings (which is generally good) but you don't specify an encoding for your output stream etc.
For development I'm using ResourceBundle to read a UTF-8 encoded properties-file (I set that in Eclipse' file properties on that file) directly from my resources-directory in the IDE (native2ascii is used on the way to production), e.g.:
menu.file.open.label=&Öffnen...
label.btn.add.name=&Hinzufügen
label.btn.remove.name=&Löschen
Since that causes issues with the character encoding when using non-ASCII characters I thought I'd be happy with:
ResourceBundle resourceBundle = ResourceBundle.getBundle("messages", Locale.getDefault());
String value = resourceBundle.getString(key);
value = new String(value.getBytes(), "UTF-8");
Well, it does work nicely for lower-case German umlauts, but not for the upper-case ones, the ß also doesn't work. Here's the value read with getString(key) and the value after the conversion with new String(value.getBytes(), "UTF-8"):
&Löschen => &Löschen
&Hinzufügen => &Hinzufügen
&Ã?ber => &??ber
&SchlieÃ?en => &Schlie??en
&Ã?ffnen... => &??ffnen...
The last three should be:
&Ã?ber => &Über
&SchlieÃ?en => &Schließen
&Ã?ffnen... => &Öffnen...
I guess that I'm not too far away from the truth, but what am I missing here?
Google found something similar, but that remained unanswered.
EDIT: a little more code
The problem is you're calling String.getBytes() without specifying an encoding - which will use the default platform encoding. You're then using the binary result of that operation as if it were in UTF-8.
If you use UTF-8 in both directions, it'll be fine:
// Should be a round-trip
value = new String(value.getBytes("UTF-8"), "UTF-8");
... but if you were trying to use this to read a UTF-8-encoded property file without telling the code which is performing the initial read, that won't work.
The code you've presented is basically always the wrong approach. Your "Since that causes issues with the character encoding" suggests that you'd already run across an earlier problem - so I'd go back to that, instead of trying to apply a broken fix. If you've already lost data when constructing the ResourceBundle, it's too late to go back later... you need to make sure the ResourceBundle itself is loaded correctly.
Please tell us exactly what problems you had with the ResourceBundle, and we can see if we can fix the root cause.
EDIT: It's not clear how you're running native2ascii. The fix may be as simple as changing to use:
native2ascii -encoding UTF-8 input.properties output.properties
Some notes:
If it is a String it is UTF-16 and if it isn't it is a corrupt string (and too late to fix.)
new String(value.getBytes(), "UTF-8"); - this code will (at best) do nothing on a system that uses UTF-8 as the default encoding; otherwise it will corrupt the string.
.properties files must be ISO 8859-1 (the Properties type supports other formats and encodings, but I don't know how you would tell ResourceBundle that.)
System.out can introduce its own transcoding bugs (the PrintStream encodes UTF-16 strings to the default encoding; the receiving device must decode the bytes using the same encoding.)
I suspect you are trying to fix your problems in the wrong place.
You are encoding the text with a different encoding to the one you are decoding with.
Try instead using the same character set for encoding and decoding.
value = new String(value.getBytes("UTF-8"), "UTF-8");
String s = "ßßßßß";
s += s.toUpperCase();
s = new String(s.getBytes("UTF-8"), "UTF-8");
System.out.println(s);
prints
ßßßßßSSSSSSSSSS
Today I was talking to one of my colleagues and he was pretty much on the same path as the other answers have mentioned. So I tried to achieve what Jon Skeet had mentioned, meaning creating the same file as in production. Since rebuilding the project after each change of a resource is out of question and I hadn't done any of what solved this (and I guess this will be new to some) let me line it out (even if it may be just for personal reference ;) ). In short this uses Eclipse' project builders.
Create an Ant-style build.xml
<?xml version="1.0" encoding="UTF-8"?>
<project>
<property name="dir.resources" value="src/main/resources" />
<property name="dir.target" value="bin/main" />
<target name="native-to-ascii">
<delete dir="${dir.target}" includes="**/*.properties" />
<native2ascii src="${dir.resources}" dest="${dir.target}" includes="**/*.properties" />
</target>
</project>
Its intention is to delete the properties-files in the target directory and use native2ascii to recreate them. The delete is necessary as native2ascii won't overwrite existing files.
In Eclipse go to the project properties and select "Builders", click "New...", pick "Ant Builder" (that's the slightly enhanced editor for run configurations)
In "Main" let "Buildfile" point to the Ant-script, set "Base Directory" to ${project_loc}
In "Refresh" tick "Refresh resources upon completion" and pick "The project containing the selected resource"
In "Targets" click "Set Targets" next to the "Auto Build" and pick native-to-ascii there (note that for some reason I had to do this later again)
This might not be necessary for everybody, but in "JRE" pick a proper execution environment
In "Build Options" tick off "Allocate Console" (however, you may want to keep this ticked on until you see that it's all working)
"Apply", "OK"
I was told that the newly created builder should be somewhere underneath the Java Builder (use Up/Down-button)
In the "Java Build Path" select the source folder with the resources (src/main/resources for me) and add an exclusion for **/*.properties
That should have been it. If you edit a properties-file and save it, it should automatically be converted to ASCII in the output folder. You can try with entering ü, which should end up as \u00fc.
Note that if you have a lot of properties-files, this may take some time. Just don't save after every keypress. :)
I'm using HtmlCleaner to scrape a ISO-8859-1 encoded web site in Android.
I've implemented this in an external jar file that I import into my Android app.
When I run the unit tests in Eclipse it handles Norwegian letters (æ,ø,å) correct (I can verify that in the debugger), but in the Android app these characters look like inverted question marks.
If I attach the debugger to my Android app I can see that these letters are not correct in the exact same places they were good when running unit test from Eclipse, so it's not a display/render/view issue in the Android app.
When I copy the text from the debuggers I get these results:
Java Process (Unit Test): «Blårek», «Benny»
Android Process (In emulator): «Bl�rek», «Benny»
I would expect these Strings to be equal, but notice how the "å" is replaed by the inverted question marks in Android.
I have tried running htmlCleaner.getProperties().setRecognizeUnicodeChars(true) without any luck. Also, I found no way of forcing UTF-8 or ISO-8859-1 encoding in html cleaner, but I' not sure if that would have made a difference.
Here is the code i run:
HtmlCleaner htmlCleaner = new HtmlCleaner();
// connect to url and get root TagNode from HtmlCleaner
InputSteram is = new URL( url ).openConnection().getInputStream();
TagNode rootNode = htmlCleaner.clean( is );
// navigate through some TagNodes, getting the ContentNode
ContentNode cn = rootNode...
// This String contains the incorrectly decoded characters on Android.
// Good in Oracle JVM though..
String value = cn.toString().trim();
Does anyone knows what could cause the decoding behavoir to be different on Android? I guess the main difference between the two environments is that the Android app uses Android's java.io stack while my unit tests use Sun/Oracle's stack.
Thanks,
Geir
HtmlCleaner can't tell what encoding to use; you are passing only the body of the response in the InputStream, but the encoding is in the "content-type" header.
You can set the character encoding on the properties of the HtmlCleaner to the correct encoding from the HTTP connection. But that would require you to parse the correct parameter from the content-type header. Alternatively, you can pass a URL instance to HtmlCleaner and let it manage the connection. Then, it will have access to all the information it needs to decode properly.