javac does not output unicode on command line - java

Context: Windows 10, cmd.exe, javac 9.0.1.
I have unicode encoded source code. If I run javac -encoding UTF-8 ... and I have an error, I just can't get it to display the source correctly.
As you can see in the picture, the cli can print unicode chars just fine.

It would appear that javac is not using your terminal's character encoding.
You can specify the character encoding for a JVM using the flag:
java -Dfile.encoding=UTF-8 ...
(Or whatever encoding)
Javac is just a thin wrapper around a Java program. You can pass arguments directly to its JVM using the -J flag. So:
javac -J-Dfile.encoding=UTF-8 ...

You can know your current(default) encoding by running
System.getProperty("file.encoding")
and you can change default encoding with this property.
For Windows it is usually - cp1252,
Long Story, queue from IBM KB:
Internally, the Java virtual machine (JVM) always operates with data
in Unicode. However, all data transferred into or out of the JVM is in
a format matching the file.encoding property. Data read into the JVM
is converted from file.encoding to Unicode and data sent out of the
JVM is converted from Unicode to file.encoding.

Related

Passing arguments containing non-ASCII characters from bash to a Java program

Is it possible to reliably pass non-ASCII characters as command-line arguments from bash on CentOS? I keep getting wrongly encoded chars form the args.
In my case it's a pesky ASCII 85h character which is defined only for Cp1250 but not for UTF-8 or ISO-8859-*.
#!/bin/bash
IFS= read -r -n 10 -d '' ARG < "$INPUT_FILE"
java -jar foo.jar "$ARG"
The shell's LANG/LC_* can't be set to Cp1250. I guess this might the culprit, right? The shell kinda tries to pass it in a "binary way" but apparently fails.
AFAIK, the -Dfile.encoding can be used to override JVM's detected shell charset in args. Is this relevant? I've tried that but no luck here.

how to convert java class encoding to utf-8 [duplicate]

What is the fastest, easiest tool or method to convert text files between character sets?
Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa.
Everything goes: one-liners in your favorite scripting language, command-line tools or other utilities for OS, web sites, etc.
Best solutions so far:
On Linux/UNIX/OS X/cygwin:
Gnu iconv suggested by Troels Arvin is best used as a filter. It seems to be universally available. Example:
$ iconv -f UTF-8 -t ISO-8859-15 in.txt > out.txt
As pointed out by Ben, there is an online converter using iconv.
recode (manual) suggested by Cheekysoft will convert one or several files in-place. Example:
$ recode UTF8..ISO-8859-15 in.txt
This one uses shorter aliases:
$ recode utf8..l9 in.txt
Recode also supports surfaces which can be used to convert between different line ending types and encodings:
Convert newlines from LF (Unix) to CR-LF (DOS):
$ recode ../CR-LF in.txt
Base64 encode file:
$ recode ../Base64 in.txt
You can also combine them.
Convert a Base64 encoded UTF8 file with Unix line endings to Base64 encoded Latin 1 file with Dos line endings:
$ recode utf8/Base64..l1/CR-LF/Base64 file.txt
On Windows with Powershell (Jay Bazuzi):
PS C:\> gc -en utf8 in.txt | Out-File -en ascii out.txt
(No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)
Edit
Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa
gc -en string in.txt | Out-File -en utf8 out.txt
Note: The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".
CsCvt - Kalytta's Character Set Converter is another great command line based conversion tool for Windows.
Stand-alone utility approach
iconv -f ISO-8859-1 -t UTF-8 in.txt > out.txt
-f ENCODING the encoding of the input
-t ENCODING the encoding of the output
You don't have to specify either of these arguments. They will default to your current locale, which is usually UTF-8.
Try VIM
If you have vim you can use this:
Not tested for every encoding.
The cool part about this is that you don't have to know the source encoding
vim +"set nobomb | set fenc=utf8 | x" filename.txt
Be aware that this command modify directly the file
Explanation part!
+ : Used by vim to directly enter command when opening a file. Usualy used to open a file at a specific line: vim +14 file.txt
| : Separator of multiple commands (like ; in bash)
set nobomb : no utf-8 BOM
set fenc=utf8 : Set new encoding to utf-8 doc link
x : Save and close file
filename.txt : path to the file
" : qotes are here because of pipes. (otherwise bash will use them as bash pipe)
Under Linux you can use the very powerful recode command to try and convert between the different charsets as well as any line ending issues. recode -l will show you all of the formats and encodings that the tool can convert between. It is likely to be a VERY long list.
Get-Content -Encoding UTF8 FILE-UTF8.TXT | Out-File -Encoding UTF7 FILE-UTF7.TXT
The shortest version, if you can assume that the input BOM is correct:
gc FILE.TXT | Out-File -en utf7 file-utf7.txt
iconv(1)
iconv -f FROM-ENCODING -t TO-ENCODING file.txt
Also there are iconv-based tools in many languages.
Try iconv Bash function
I've put this into .bashrc:
utf8()
{
iconv -f ISO-8859-1 -t UTF-8 $1 > $1.tmp
rm $1
mv $1.tmp $1
}
..to be able to convert files like so:
utf8 MyClass.java
Try Notepad++
On Windows I was able to use Notepad++ to do the conversion from ISO-8859-1 to UTF-8. Click "Encoding" and then "Convert to UTF-8".
Oneliner using find, with automatic character set detection
The character encoding of all matching text files gets detected automatically and all matching text files are converted to utf-8 encoding:
$ find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -bi "$1" |sed -e "s/.*[ ]charset=//") -t utf-8 -o converted "$1" && mv converted "$1"' -- {} \;
To perform these steps, a sub shell sh is used with -exec, running a one-liner with the -c flag, and passing the filename as the positional argument "$1" with -- {}. In between, the utf-8 output file is temporarily named converted.
Whereby file -bi means:
-b, --brief
Do not prepend filenames to output lines (brief mode).
-i, --mime
Causes the file command to output mime type strings rather than the more traditional human readable ones. Thus it may say for example text/plain; charset=us-ascii rather than ASCII text. The sed command cuts this to only us-ascii as is required by iconv.
The find command is very useful for such file management automation.
Click here for more find galore.
Assuming, you don't know the input encoding and still wish to automate most of the conversion, I concluded this one liner from summing up previous answers.
iconv -f $(chardetect input.text | awk '{print $2}') -t utf-8 -o output.text
DOS/Windows: use Code page
chcp 65001>NUL
type ascii.txt > unicode.txt
Command chcp can be used to change the code page. Code page 65001 is Microsoft name for UTF-8. After setting code page, the output generated by following commands will be of code page set.
PHP iconv()
iconv("UTF-8", "ISO-8859-15", $input);
Try EncodingChecker
EncodingChecker on github
File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can display the encoding for all selected files, or only the files that do not have the encodings you specify.
File Encoding Checker requires .NET 4 or above to run.
For encoding detection, File Encoding Checker uses the UtfUnknown Charset Detector library. UTF-16 text files without byte-order-mark (BOM) can be detected by heuristics.
to write properties file (Java) normally I use this in linux (mint and ubuntu distributions):
$ native2ascii filename.properties
For example:
$ cat test.properties
first=Execução número um
second=Execução número dois
$ native2ascii test.properties
first=Execu\u00e7\u00e3o n\u00famero um
second=Execu\u00e7\u00e3o n\u00famero dois
PS: I writed Execution number one/two in portugues to force special characters.
In my case, in first execution I received this message:
$ native2ascii teste.txt
The program 'native2ascii' can be found in the following packages:
* gcj-5-jdk
* openjdk-8-jdk-headless
* gcj-4.8-jdk
* gcj-4.9-jdk
Try: sudo apt install <selected package>
When I installed the first option (gcj-5-jdk) the problem was finished.
I hope this help someone.
With ruby:
ruby -e "File.write('output.txt', File.read('input.txt').encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: ''))"
Source: https://robots.thoughtbot.com/fight-back-utf-8-invalid-byte-sequences
Simply change encoding of loaded file in IntelliJ IDEA IDE, on the right of status bar (bottom), where current charset is indicated. It prompts to Reload or Convert, use Convert. Make sure you backed up original file in advance.
In powershell:
function Recode($InCharset, $InFile, $OutCharset, $OutFile) {
# Read input file in the source encoding
$Encoding = [System.Text.Encoding]::GetEncoding($InCharset)
$Text = [System.IO.File]::ReadAllText($InFile, $Encoding)
# Write output file in the destination encoding
$Encoding = [System.Text.Encoding]::GetEncoding($OutCharset)
[System.IO.File]::WriteAllText($OutFile, $Text, $Encoding)
}
Recode Windows-1252 "$pwd\in.txt" utf8 "$pwd\out.txt"
For a list of supported encoding names:
https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding
There is also a web tool to convert file encoding: https://webtool.cloud/change-file-encoding
It supports wide range of encodings, including some rare ones, like IBM code page 37.
Use this Python script: https://github.com/goerz/convert_encoding.py
Works on any platform. Requires Python 2.7.
My favorite tool for this is Jedit (a java based text editor) which has two very convenient features :
One which enables the user to reload a text with a different encoding (and, as such, to control visually the result)
Another one which enables the user to explicitly choose the encoding (and end of line char) before saving
If macOS GUI applications are your bread and butter, SubEthaEdit is the text editor I usually go to for encoding-wrangling — its "conversion preview" allows you to see all invalid characters in the output encoding, and fix/remove them.
And it's open-source now, so yay for them 😉.
Visual Studio Code
Open your file in Visual Studio Code
Reopen with Encoding: In the bottom status bar, to the right, you should see your current file encoding (eg "UTF-8"). Click this and select "Reopen with Encoding".
Select the correct encoding of the file (eg: ISO 8859-2).
Confirm that your content is displaying as expected.
Save with Encoding: The bottom status bar should now display your new encoding format (eg: ISO 8859-2). Click this and choose "Save with Encoding" and select UTF-8 (or whatever new encoding you want).
NOTE: THIS WILL OVERWRITE YOUR ORGINIAL FILE. MAKE A BACKUP FIRST.
As described on How do I correct the character encoding of a file? Synalyze It! lets you easily convert on OS X between all encodings supported by the ICU library.
Additionally you can display some bytes of a file translated to Unicode from all the encodings to see quickly which is the right one for your file.

Java difference between running from netbeans and cmd

I have a program that writes text data to files. When I run it from netbeans the files are in a correct encoding and you can read them with a notepad. When I run it from cmd using java -cp ....jar the encoding is different.
What may be the issue??
ps. I've checked that the jre. versions are the same that executes (v 1.8.0_31)
Netbeans startup scripts may specify a different encoding than your system default. You can check in your netbeans.conf.
You can set the file.encoding property when invoking java. For example, java -Dfile.encoding=UTF8 -cp... jar.
If you do not want to be surprised when running your code on different environments, even better solution would be to specify the encoding in your source code.
Further reading:
file encoding: Character and Byte Streams
netbeans.conf encoding options: How To Display UTF8 In Netbeans 7?

Eclipse wrong Java properties UTF-8 encoding

I have a JavaEE project, in which I use message properties files. The encoding of those file is set to UTF-8. In the file I use the german umlauts like ä, ö, ü. The problem is, sometimes those characters are replaced with unicode like \uFFFD\uFFFD, but not for every character. Now, I have a case where ä and ü are both replaced with \uFFFD\uFFFD, but not for every occurring of ä and ü.
The Git diff shows me something like this:
mail.adresses=E-Mail hinzufügen:
-mail.adresses.multiple=E-Mails durch Kommata getrennt hinzufügen.
+mail.adresses.multiple=E-Mails durch Kommata getrennt hinzuf\uFFFD\uFFFDgen.
mail.title=Einladungs-E-Mail
box.preview=Vorschau
box.share.text=Sie können jetzt die ausgewählten Bilder mit Ihren Freunden teilen.
## -6880,7 +6880,7 ## browser.cancel=Abbrechen
browser.selectImage=übernehmen
browser.starImage=merken
browser.removeImage=Löschen
-browser.searchForSimilarImages=ähnliche
+browser.searchForSimilarImages=\uFFFD\uFFFDhnliche
browser.clear_drop_box=löschen
Also, there are lines changed, which I have not touched. I don't understand why I get such a behavior. What could be the cause for the above problem?
My system:
Antergos / Arch Linux
System encoding UTF-8
Python 3.5.0 (default, Sep 20 2015, 11:28:25)
[GCC 5.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
Eclipse Mars 1
Text file encoding UTF-8
Properties file encoding UTF-8
Tomcat 8
Java JDK 8
If I use another Editor like Atom to edit those message properties files, I don't ran into this problem.
I also realized in a case, if I copy the original value browser.searchForSimilarImages=ähnliche from Git diff and replace the wrong value browser.searchForSimilarImages=\uFFFD\uFFFDhnliche in Eclipse with that, then I have the correct umlauts in the message properties file.
Root cause:
By default ISO 8859-1 character encoding is used for Eclipse properties file (read here), so if the file contains any character beyond ISO 8859-1 then it will not be processed as expected.
Solution 1
If you use Eclipse then you will notice that it implicitly converts the special character into \uXXXX equivalent. Try copying
会意字 / 會意字
into a properties file opened in Eclipse.
EDIT: As per comment from OP
Update the encoding of your Eclipse as shown below. If you set encoding as UTF-32 then even you can see Chinese character, which you cannot see generally.
How to change Encoding of properties file in Eclipse: See this Eclipse Bugzilla bug for more details, which talks about several other possibilities and in the end suggest what I have highlighted below.
Chinese characters can be seen in Eclipse after encoding is set properly:
Solution 2
If above doesn't work consistently for you (it does work for me and I never see encoding issues) then try this using some Eclipse plugin which handles encoding of properties or other files. For example Eclipse ResourceBundle Editor or Extended Resource-Bundle editor
I would recommend using Eclipse ResourceBundle Editor.
Solution 3
Another possibility to change encoding of file is using Edit --> Set Encoding option. It really matters because it changes the default character set and file encoding. Play around with by changing encoding using Edit --> Set Encoding option and do following Java sysout System.out.println("Default Charset=" + Charset.defaultCharset()); and System.out.println(System.getProperty("file.encoding"));
As an aside: 1
Process the properties file to have content with ISO 8859-1 character encoding by using native2ascii - Native-to-ASCII Converter
What native2ascii does: It converts all the non-ISO 8859-1 character in their equivalent \uXXXX. This is a good tool because you need not to search the \uXXXX equivalent of special character.
Usage for UTF-8: native2ascii -encoding utf8 e:\a.txt e:\b.txt
As an aside: 2
Every computer program whether an IDE, application server, web server, browser, etc. understands only bits, so it need to know how to interpret the bits to make expected sense out of it because depending upon encoding used, same bits can represent different characters. And that's where "Encoding" comes into picture by giving a unique identifier to represent a character so that all computer programs, diverse OS etc. knows exact right way to interpret it.
So, if you have written into a file using some encoding scheme, lets say UTF-8, and then reading using any editor but running with encoding scheme as UTF-8 then you can expect to get correct display.
Please do read my this answer to get more details but from browser-server perspective.
Add the following arguments to your eclipse.ini file.
-Dclient.encoding.override=UTF-8
-Dfile.encoding=UTF-8
By default Eclipse uses the encoding format picked up by the Java Virtual Machine (JVM). Also, you can set the file encoding to utf-8.
Resolved by doing the below changes :
Modified below properties in eclipse.ini and close and start the eclipse applications
-Dclient.encoding.override=UTF-8
-Dfile.encoding=UTF-8
Set the encoding to the UTF-8 [Navigation path : Edit -> Set encoding]
Properties Files are expected to be ISO-8859-1 (Latin-1) encoded.
Most likely this what eclipse was set to by default as well.
You have to make sure that every tool which is run in the build or whatever disregards the spec and uses UTF-8 instead.
This looks like a mixture of Eclipse and git encoding or rather not-encoding.
Git uses raw bytes and doesn't care about encoding. Using git diff you might get characters like shown here. An example there is R<C3><BC>ckg<C3><A4>ngig # should be "Rückgängig".
As you can see there's two funny bracket things showing per umlaut. And in your editor, there are always two \uFFFD for each umlaut in the lines starting with +.
So I assume that your UTF-8 editor tries to interpret the git notation and fails. This in turn leads to the representation \uFFFD, which basically meands that this is character whose value is unknown or unrepresentable (see here).
Like suggested in the first link, you can try setting LESSCHARSET=UTF-8 in your environment variable (Windows). Hmm, in Linux it should be in etc/profile ?
see: a marker such as FFFD (REPLACEMENT CHARACTER) in http://unicode.org/faq/utf_bom.html
and see native2ascii --help
-encoding encoding_name
Specifies the name of the character encoding to be used by the conversion procedure. If this option is not present, then the
default character encoding (as determined by the java.nio.charset.Charset.defaultCharset method) is used. The encoding_name
string must be the name of a character encoding that is supported by the JRE. See Supported Encodings at
http://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
a case
$ file yourfile.properties
yourfile.properties : ISO-8859 text, with very long lines
$ native2ascii -encoding ISO-8859-1 yourfile.properties yourfile.properties

Java source file encoding with Chinese character

I import a Java project from Windows platform to Ubuntu.
My Ubuntu is 10.10, Gnome environment: My LANGUAGE is set to en_US:en
My terminal's character encoding is: Unicode (UTF-8)
My IDE is eclipse and text file encoding is: GBK.
In source file, there are some Chinese constant character.
The project build successful on Windows with ant,
but on Ubuntu, I get compile error:
illegal character: \65533
I don't want to use \uxxxx format as the file is already there,
And I've tried the -encoding option for javac, but still can't compile.
I think the problem lies not with Ubuntu, Ubuntu's console, Javac or Eclipse but with the way you transfer the file from windows to Ubuntu. You have to store it as utf-8 before you copy it to Ubuntu otherwise the codepoint-information that is set in your Windows your locale is already lost.
Did you specify the encoding option of the <javac> task in your build.xml?
It should look like this:
<javac encoding="GBK" ...>
If you haven't specified it, then on Windows it will use the platform default encoding (which is GBK in your setup) and on Linux it will use the platform default encoding (which is UTF-8 in your setup).
Since you want the build to work on both platforms (preferably without changing the configuration of either platform), you need to specify the encoding when you compile.
You need to convert you source codes from you windows codepage to UTF-8. Use iconv for this.

Categories

Resources