Why do I need to escape unicode in java source files?

Why do I need to escape unicode in java source files? - java

Please note that I'm not asking how but why. And I don't know if it's a RCP specific problem or if it's something inherent to java.
My java source files are encoded in UTF-8.
If I define my literal strings like this :
new Language("fr", "Français"),
new Language("zh", "中文")
It works as I expect when I use the string in the application by launching it from Eclipse as an Eclipse application :
But if fails when I launch the .exe built by the "Eclipse Product Export Wizard" :
The solution I use is to escape the chars like this :
new Language("fr", "Fran\u00e7ais"), // Français
new Language("zh", "\u4e2d\u6587") // 中文
There is no problem in doing this (all my other strings are in properties files, only the languages names are hardcoded) but I'd like to understand.
I thought the compiler had to convert the java literal strings when building the bytecode. So why is the unicode escaping necessary ? Is it wrong to use use high range unicode chars in java source files ? What happens exactly to those chars at compilation and in what it is different from the handling of escaped chars ? Is the problem just related to RCP cache ?

It appears that the Eclipse Product Export Wizard is not interpreting your files as UTF-8. Perhaps you need to run Eclipse's JVM with the encoding set to UTF-8 (-Dfile.encoding=UTF8 in eclipse.ini)?
(Copypasta'd at OPs request)

When exporting a plug-in, it gets compiled through a process separate from the normal build process within the IDE. There is a known bug that the build process (PDE.Build) disregards the text encoding used by the IDE.
The export can be made to work properly by specifying the text encoding in the build.properties file of your plugin
javacDefaultEncoding.. =UTF-8

Related

Spring Item Writer producing file with first line prefix with non-printable characters

I recently discovered that file produced by flat file item writer is prefixed with non printable characters. I have attached screenshot below.
Project uses Spring 4.1.7 RELEASE jar with Spring batch version 2.2.5 RELEASE on java 8 platform. Any idea to resolve this?
See screenshot
[Update 03/20] This issue is resolved. Output file extension chosen was .out and for some reason when this file was created on Windows platform it had non-printable characters in the beginning of the line. When extension was changed to .txt, output was as expected. On Linux, output file with .out extension has no issues. To conclude, it was OS platform specific issue, but if someone knows actual reason behind this then please shed some light.

Without any code to look at, I'm guessing you are writing a line (probably a String) out that contains nothing but NULL values in Java...hence you get NULNULNULNUL... in the file.

Java encoding issue with ├ └

If I print
System.out.println("│ ├── └──");
I see only question marks (???). Seams that this is some king of encoding problem. Any ideas how to fix this?

Use the UTF-8 codes instead of the actual characters. For example ├ is \u251c.
Here is a link that will help you convert characters to corresponding codes: http://www.cylog.org/online_tools/utf8_converter.jsp
Hope it helps!

Any ideas how to fix this?
There are two possible causes of your problem:
1) It could be happening when you edit compile source code. The compiler could be reading the source code using a different file encoding to the one that you are using when you edit it. If you don't specify a source file encoding, the compiler will use a platform-specific default, and that might not be the right one.
The fix for this is to adjust your compiler settings to specify the correct source file encoding. How you do that will depend on how you are compiling. If you are compiling from the command line using javac, use the -encoding option.
Alternatively, a workaround for this problem is to replace the offending in your source code with Unicode escapes. For example:
String s = "\u251c";
should give you a one character string consisting of a "├" character. I would recommend the workaround. Source code that includes non-ASCII characters is always going to be sensitive to how you edit and compile ... and that is not a good thing.
2) It could be happening because there is a mismatch between your Java runtime platform's default output encoding and the actual encoding of whatever is displaying the output.
The fix for this is one of:
change the encoding for the display,
override the default encoding for the JVM (e.g. using -Dfile.encoding=UTF-8), or
change your code to output using a specific encoding.
Which is best depends on the circumstances; e.g. why things are "wrong" in the first place.
It is worth running this test application from the command prompt to see if the problem exists their too. If it does, then redirect standard output to a file, and use a hex dump utility (e.g. od on Linux) to see how the characters are encoded. That will help you distinguish causes 1) and 2) above.
(It is also possible that you have both problems ...)

The encoding of the java file (editor( and the encoding that the javac compiler better both use UTF-8. This generally is a IDE or project setting.
One might check that both encodings are equal, by the u-escaping of those chars: \u251C etcetera,
System.out must use the operating system encoding. If that encoding cannot convert those characters, one might see a ?. If the console is a console emulation of the IDE, you might search the setting of that encoding. Also check that the console font contains those graphic chars. Running the IDE with java -Dfile.encoding UTF-8 might help.
In your case: Strange. Check the source encoding with gedit, dump System.getProperty("file,encoding").

Encoding for project set to UTF-8, default charset returns windows-1252

I've ran into an issue with encoding. Not sure if it's IDE related but I'm using NetBeans 7.4. I got this piece of code in my J2EE project:
String test = "kukuřičné";
System.out.println(new String(test.getBytes("UTF-8"))); // should display ok
System.out.println(new String(test.getBytes("ISO-8859-1")));
System.out.println(new String(test.getBytes("UTF-16")));
System.out.println(new String(test.getBytes("US-ASCII")));
System.out.println(new String(test.getBytes("windows-1250")));
System.out.println(test); // should display ok
And when I run it, it never displays properly. UTF-8 should be able to print that out ok but it doesn't. Also when I tried:
System.out.println(Charset.defaultCharset());
it returned windows-1252. The project is set to UTF-8 encoding. I've even tried resaving this specific java file in UTF-8 but it still doesn't display properly.
I've tried to create J2SE project on the other hand and when I run the same code it displays properly. Also the default charset returns UTF-8.
Both projects are set the UTF-8 encoding.
I want my J2EE project to act the same like the J2SE one. I didn't notice this issue until I updated my java to version 1.7.0_51-b13 but again I'm not sure if that is related.
I'm experiencing the same issue like this guy: http://forums.netbeans.org/ptopic37752.html
I've also tried setting the default encoding for the whole IDE: -J-Dfile.encoding=UTF-8 but it didn't help.
I've noticed an important fact. When I create a new web application it displays ok. When I create new Maven web application it displays incorrectly.
Found the same issue here: https://netbeans.org/bugzilla/show_bug.cgi?id=224526
I still haven't fixed it yet. There's still no solution working.
In my pom.xml the encoding is set properly, but it still shows windows-1252 in the end.
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

I've spend few hours trying to find the best solution.
First of all this is an issue of maven which picks up platform encoding and uses it even though you've specified different encoding to be used. Maven doesn't seem to care (it even prints to console that it's using UTF-8 but when you run a file with the code above, it won't display properly).
I've managed to tackle this issue by setting a system variable:
JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8
There should be another option instead of setting system variables and that is to set it as additional compiler parameter.
like javac -Dfile.encoding=UTF8

You are mixing a few concepts here:
the project encoding is the encoding used to save the Java source files (xxxx.java) - it has nothing to do with how your code executes
test.getBytes("UTF-8") returns a series of bytes representing your String in UTF-8 encoding
to recreate the original string, you need to explicitly give the encoding, unless it is the default of your machine: new String(test.getBytes("UTF-8"), StandardCharsets.UTF_8)

Saving JSP as UTF-8 in NetBeans

i've got some jsp files from another developers and now need to work with them. When i add to the document any UTF-8 char and want to save the document, NetBeans automatically offers me saving in ISO-8859-1.
Actually i'm getting this message from NetBeans:
The index.jsp contains characters
which will probably be damaged during
conversion to the ISO-8859-1 character
set. Do you want to save the file
using this character set? (Yes/No)
NB didn't offer me any other option like saving the file as UTF-8 (as it should be already written in).
I don't know how to save those jsp files in the character set they are already written in.
And don't tell me, that changing the content of the file itself (which is uneffective due to including headers etc. from other files) is the only way...
http://forums.netbeans.org/topic8750.html

Firstly; don't forget to consider this line at top:
<%#page contentType="text/html" pageEncoding="UTF-8"%>
Secondly;
In the NetBeans folder there is a config file. There should be a line like that:
netbeans_default_options="-J-Xms32m -J-Xmx128m -J-XX:PermSize=32m -J-XX:MaxPermSize=160m -J-Xverify:none -J-Dapple.laf.useScreenMenuBar=true"
Add this to the end of the line:
-J-Dfile.encoding=UTF-8
Thirdly:
NetBeans implements a project encoding setting.
To change the language encoding for a project:
Right-click a project node in the Projects windows and choose Properties.
Under Sources, select an encoding value from the Encoding drop-down field.
The encoding affects at least:
* how non-ASCII characters are displayed in the editor window when you open files
* Java file compilation of sources containing non-ASCII identifiers, string literals, or comments
* textual search for international characters over the project
Starting from NetBeans IDE 6.8, you can also specify the encoding that will be used at runtime. For example, this can be useful when the encoding for the operating system on which the application will run is different from your project's encoding.
To specify the encoding to be used at runtime:
In the Files window for your project, open nbproject > private > private.properties
Add the following line to the private.properties file and save changes:
runtime.encoding = < encoding >
This encoding will override the encoding setting for your project and will be used when running your application.
In general,
*.properties files always use ISO-8859-1 encoding plus \uXXXX escapes. (International characters will be displayed natively in the editor but stored as an escape on disk.)
*.xml files and some *.html files can specify their own encodings, regardless of the project encoding. For such files, the IDE's editor ignores the project encoding.
These may help you.
Sources for my answer that I used:
Link1: http://forums.netbeans.org/topic33.html
Link2: http://wiki.netbeans.org/FaqI18nProjectEncoding

Illegal Character when trying to compile java code

I have a program that allows a user to type java code into a rich text box and then compile it using the java compiler. Whenever I try to compile the code that I have written I get an error that says that I have an illegal character at the beginning of my code that is not there. This is the error the compiler is giving me:
C:\Users\Travis Michael>"\Program Files\Java\jdk1.6.0_17\bin\javac" Test.java
Test.java:1: illegal character: \187
∩╗┐public class Test
^
Test.java:1: illegal character: \191
∩╗┐public class Test
^
2 errors

The BOM is generated by, say, File.WriteAllText() or StreamWriter when you don't specify an Encoding. The default is to use the UTF8 encoding and generate a BOM. You can tell the java compiler about this with its -encoding command line option.
The path of least resistance is to avoid generating the BOM. Do so by specifying System.Text.Encoding.Default, that will write the file with the characters in the default code page of your operating system and doesn't write a BOM. Use the File.WriteAllText(String, String, Encoding) overload or the StreamWriter(String, Boolean, Encoding) constructor.
Just make sure that the file you create doesn't get compiled by a machine in another corner of the world. It will produce mojibake.

That's a byte order mark, as everyone says.
javac does not understand the BOM, not even when you try something like
javac -encoding UTF8 Test.java
You need to strip the BOM or convert your source file to another encoding. Notepad++ can convert a single files encoding, I'm not aware of a batch utility on the Windows platform for this.
The java compiler will assume the file is in your platform default encoding, so if you use this, you don't have to specify the encoding.

If using an IDE, specify the java file encoding (via the properties panel)
If NOT using an IDE, use an advanced text-editor (I can recommend Notepad++) and set the encoding to "UTF without BOM", or "ANSI", if that suits you.

In this case do the following Steps 1-7
In Android Studio
1. Menu -> Edit -> Select All
2. Menu -> Edit -> Cut
Open new Notepad.exe
In Notepad
4. Menu -> Edit -> Paste
5. Menu -> Edit -> Select All
6. Menu -> Edit -> Copy
Back In Android Studio
7. Menu -> Edit -> Paste

http://en.wikipedia.org/wiki/Byte_order_mark
The byte order mark (BOM) is a Unicode
character used to signal the
endianness (byte order) of a text file
or stream. Its code point is U+FEFF.
BOM use is optional, and, if used,
should appear at the start of the text
stream. Beyond its specific use as a
byte-order indicator, the BOM
character may also indicate which of
the several Unicode representations
the text is encoded in.
The BOM is a funky-looking character that you sometimes find at the start of unicode streams, giving a clue what the encoding is. It's usually handles invisibly by the string-handling stuff in Java, so you must have confused it somehow, but without seeing your code, it's hard to see where.
You might be able to fix it trivially by manually stripping the BOM from the string before feeding it to javac. It probably qualifies as whitespace, so try calling trim() on the input String, and feeding the output of that to javac.

That's a problem related to BOM (Byte Order Mark) character. Byte Order Mark BOM is an Unicode character used for defining a text file byte order and comes in the start of the file. Eclipse doesn't allow this character at the start of your file, so you must delete it. for this purpose, use a rich text editor like Notepad++ and save the file with encoding "UTF-8 without BOM". That should remove the problem.
I have copy pasted the some content from a website to a Notepad++ editor,
it shows the "LS" with black background. Have deleted the "LS" content and
have copy the same content from notepad++ to java file, it works fine.

I solved this by right clicking in my textEdit program file and selecting [substitutions] and un-checking smart quotes.

instead of getting Notepad++,
You can simply
Open the file with Wordpad
and then
Save As - Plain Text document

Even I was facing this issue as am using notepad++ to code. It is very convenient to type the code in notepad++. However after compiling I get an error " error: illegal character: '\u00bb'".
Solution :
Start writing the code in older version of notepad(which will be there by default in your PC) and save it. Later the modifications can be done using notepad++.
It works!!!

I had the same problem with a file i generated using the command echo echo "" > Main.java in Windows Powershell. I searched the problem and it seemed to have something to do with encoding. I checked the encoding of the file using file -i Main.java and the result was text/plain; charset=utf-16le.
Later i deleted the file and recreated it using git bash using touch Main.java and with this the file compiled successfully. I checked the file encoding using file -i command and this time the result was Main.java: text/x-c; charset=us-ascii.
Next i searched the internet and found that to create an empty file using Powershell we can use the Cmdlet New-Item. I create the file using New-Item Main.java and checked it's encoding and this time the result was Main.java: text/x-c; charset=us-ascii and this time it compiled successully.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.