Source encoding of files in Maven java project

Source encoding of files in Maven java project - java

The source encoding of .java files in our Maven project which is stored in Subversion mostly ASCII and some files are UTF-8.
I think the intention was that these files would be UTF-8. In the pom file the source encoding is specified as UTF-8.
Now our build fails specifically our SonarQube analysis fails on a .java file which is ISO-8859 and which has a variable with a special character. Using a special character is not a good idea think but that aside, shouldn't the java files have consistent (UTF-8) encoding?
Or does it not matter that most are ASCII and only some are UTF-8? It is the thought that counts?
I btw don't understand how these files end up with ASCII encoding. When I use a IDE or editor like SublimeText files end up as UTF-8.
ASCII I only get when I use NotePad on MS Windows. Java developers do not typically use that for programming.
Should we change the source files to use UTF-8? Or maybe it doens't matter and we can leave this as it is?
As an example. Using MS Windows I create one file using SublimeText and one file using Notepad.exe. I put text 1234Ï in those files. The text contains a special character I with two dots.
When I look at these file on Linux using file
ostraaten#io:/tmp/iconv$ file sublimtext.txt
sublimtext.txt: UTF-8 Unicode (with BOM) text, with no line terminators
ostraaten#io:/tmp/iconv$ file notepad.txt
notepad.txt: ISO-8859 text, with no line terminators
ostraaten#io:/tmp/iconv$
So this shows Notepad saved the file as ISO-8859 regardless of the contents. When I check the files using iconv
ostraaten#io:/tmp/iconv$ iconv -f UTF-8 notepad.txt -o /dev/null
iconv: incomplete character or shift sequence at end of buffer
ostraaten#io:/tmp/iconv$ iconv -f UTF-8 sublimtext.txt -o /dev/null
ostraaten#io:/tmp/iconv$
I can open and save the file notepad.txt using SublimeText, the encoding still shows up as ISO-8859.
The character does display correctly in both files. So this support the idea that somewhere the editor tries to determine encoding from the contents of the file. But somewhere else the file is still marked and recognized as ISO-8859.
I can change the encoding using iconv
ostraaten#io:/tmp/iconv$ iconv -f ISO-8859-15 -t UTF-8 notepad.txt > notepad-utf8.txt
ostraaten#io:/tmp/iconv$ file notepad-utf8.txt
notepad-utf8.txt: UTF-8 Unicode text, with no line terminators
ostraaten#io:/tmp/iconv$
straaten#io:/tmp/iconv$ iconv -f UTF-8 notepad-utf8.txt -o /dev/null
The conversion was successful because the message incomplete character is gone.

Seven bits ASCII is a subset of UTF-8. ISO-8859-1 is Latin 1 with some 8 bits problematic bytes.
So someone worked around UTF-8 with editor or IDE. Some version control checkins substitute text back into the source, but in your case that seems not to be the case.
UTF-8 is a solid choice, though needs some care.

Related

Can't compile java-file on Ubuntu with special character å, ä, ö (works on windows)

I've got a java-file containing the swedish characters å,ä,ö (I have to include these in the file, I'm parsing an ISO-8859-1, latin1 text-file which includes these). The file compiles and runs on a windows computer.
While compiling the file on Ubuntu in bash I am getting following error however:
ConstructDatabase.java:76: error: unmappable character for encoding ASCII
case '?': return 27;
^
ConstructDatabase.java:77: error: unmappable character for encoding ASCII
case '?': return 28;
^
ConstructDatabase.java:78: error: unmappable character for encoding ASCII
case '?': return 29;
I tried using the javac -encoding ISO-8859-1 flag, which makes the file compile, but not work as intended. I also tried using the export LC_ALL=C command (I'm not sure why it would affect the java compiling, but it was suggested in other threads) but it still doesn't work.
Any tips would be greatly appreciated.

The compile error gives most of the information necessary: "unmappable character for encoding ASCII". That means, the compiler expects the Java source files to be encoded in ASCII, and ASCII doesn't contain these characters.
Your source files somehow contain the swedish characters, maybe encoded as ISO-8859-1, maybe as UTF-8 (just to mention the most likely ones). You have to find out which one your source files are using (that's not a trvial task).
Your Ubuntu environment seems to be using ASCII as default encoding. Then echoing the source file to the terminal might give you a hint. If instead of the swedish character you see a combination of two cryptic characters, then the file might be UTF-8 encoded.
You have to tell the correct encoding to the compiler on Ubuntu. And as javac -encoding ISO-8859-1 doesn't work as expected, I'd guess that your files are encoded in UTF-8, so javac -encoding UTF-8 might be the solution.
And instead of messing around with the encodings, you can replace the characters in the source file with the \uxxxx escapes. That will work in ASCII, UTF-8, ISO-8859-1 and all other encodings based on ASCII.

how to convert java class encoding to utf-8 [duplicate]

What is the fastest, easiest tool or method to convert text files between character sets?
Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa.
Everything goes: one-liners in your favorite scripting language, command-line tools or other utilities for OS, web sites, etc.
Best solutions so far:
On Linux/UNIX/OS X/cygwin:
Gnu iconv suggested by Troels Arvin is best used as a filter. It seems to be universally available. Example:
$ iconv -f UTF-8 -t ISO-8859-15 in.txt > out.txt
As pointed out by Ben, there is an online converter using iconv.
recode (manual) suggested by Cheekysoft will convert one or several files in-place. Example:
$ recode UTF8..ISO-8859-15 in.txt
This one uses shorter aliases:
$ recode utf8..l9 in.txt
Recode also supports surfaces which can be used to convert between different line ending types and encodings:
Convert newlines from LF (Unix) to CR-LF (DOS):
$ recode ../CR-LF in.txt
Base64 encode file:
$ recode ../Base64 in.txt
You can also combine them.
Convert a Base64 encoded UTF8 file with Unix line endings to Base64 encoded Latin 1 file with Dos line endings:
$ recode utf8/Base64..l1/CR-LF/Base64 file.txt
On Windows with Powershell (Jay Bazuzi):
PS C:\> gc -en utf8 in.txt | Out-File -en ascii out.txt
(No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)
Edit
Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa
gc -en string in.txt | Out-File -en utf8 out.txt
Note: The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".
CsCvt - Kalytta's Character Set Converter is another great command line based conversion tool for Windows.

Stand-alone utility approach
iconv -f ISO-8859-1 -t UTF-8 in.txt > out.txt
-f ENCODING the encoding of the input
-t ENCODING the encoding of the output
You don't have to specify either of these arguments. They will default to your current locale, which is usually UTF-8.

Try VIM
If you have vim you can use this:
Not tested for every encoding.
The cool part about this is that you don't have to know the source encoding
vim +"set nobomb | set fenc=utf8 | x" filename.txt
Be aware that this command modify directly the file
Explanation part!
+ : Used by vim to directly enter command when opening a file. Usualy used to open a file at a specific line: vim +14 file.txt
| : Separator of multiple commands (like ; in bash)
set nobomb : no utf-8 BOM
set fenc=utf8 : Set new encoding to utf-8 doc link
x : Save and close file
filename.txt : path to the file
" : qotes are here because of pipes. (otherwise bash will use them as bash pipe)

Under Linux you can use the very powerful recode command to try and convert between the different charsets as well as any line ending issues. recode -l will show you all of the formats and encodings that the tool can convert between. It is likely to be a VERY long list.

Get-Content -Encoding UTF8 FILE-UTF8.TXT | Out-File -Encoding UTF7 FILE-UTF7.TXT
The shortest version, if you can assume that the input BOM is correct:
gc FILE.TXT | Out-File -en utf7 file-utf7.txt

iconv(1)
iconv -f FROM-ENCODING -t TO-ENCODING file.txt
Also there are iconv-based tools in many languages.

Try iconv Bash function
I've put this into .bashrc:
utf8()
{
iconv -f ISO-8859-1 -t UTF-8 $1 > $1.tmp
rm $1
mv $1.tmp $1
}
..to be able to convert files like so:
utf8 MyClass.java

Try Notepad++
On Windows I was able to use Notepad++ to do the conversion from ISO-8859-1 to UTF-8. Click "Encoding" and then "Convert to UTF-8".

Oneliner using find, with automatic character set detection
The character encoding of all matching text files gets detected automatically and all matching text files are converted to utf-8 encoding:
$ find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -bi "$1" |sed -e "s/.*[ ]charset=//") -t utf-8 -o converted "$1" && mv converted "$1"' -- {} \;
To perform these steps, a sub shell sh is used with -exec, running a one-liner with the -c flag, and passing the filename as the positional argument "$1" with -- {}. In between, the utf-8 output file is temporarily named converted.
Whereby file -bi means:
-b, --brief
Do not prepend filenames to output lines (brief mode).
-i, --mime
Causes the file command to output mime type strings rather than the more traditional human readable ones. Thus it may say for example text/plain; charset=us-ascii rather than ASCII text. The sed command cuts this to only us-ascii as is required by iconv.
The find command is very useful for such file management automation.
Click here for more find galore.

Assuming, you don't know the input encoding and still wish to automate most of the conversion, I concluded this one liner from summing up previous answers.
iconv -f $(chardetect input.text | awk '{print $2}') -t utf-8 -o output.text

DOS/Windows: use Code page
chcp 65001>NUL
type ascii.txt > unicode.txt
Command chcp can be used to change the code page. Code page 65001 is Microsoft name for UTF-8. After setting code page, the output generated by following commands will be of code page set.

PHP iconv()
iconv("UTF-8", "ISO-8859-15", $input);

Try EncodingChecker
EncodingChecker on github
File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can display the encoding for all selected files, or only the files that do not have the encodings you specify.
File Encoding Checker requires .NET 4 or above to run.
For encoding detection, File Encoding Checker uses the UtfUnknown Charset Detector library. UTF-16 text files without byte-order-mark (BOM) can be detected by heuristics.

to write properties file (Java) normally I use this in linux (mint and ubuntu distributions):
$ native2ascii filename.properties
For example:
$ cat test.properties
first=Execução número um
second=Execução número dois
$ native2ascii test.properties
first=Execu\u00e7\u00e3o n\u00famero um
second=Execu\u00e7\u00e3o n\u00famero dois
PS: I writed Execution number one/two in portugues to force special characters.
In my case, in first execution I received this message:
$ native2ascii teste.txt
The program 'native2ascii' can be found in the following packages:
* gcj-5-jdk
* openjdk-8-jdk-headless
* gcj-4.8-jdk
* gcj-4.9-jdk
Try: sudo apt install <selected package>
When I installed the first option (gcj-5-jdk) the problem was finished.
I hope this help someone.

With ruby:
ruby -e "File.write('output.txt', File.read('input.txt').encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: ''))"
Source: https://robots.thoughtbot.com/fight-back-utf-8-invalid-byte-sequences

Simply change encoding of loaded file in IntelliJ IDEA IDE, on the right of status bar (bottom), where current charset is indicated. It prompts to Reload or Convert, use Convert. Make sure you backed up original file in advance.

In powershell:
function Recode($InCharset, $InFile, $OutCharset, $OutFile) {
# Read input file in the source encoding
$Encoding = [System.Text.Encoding]::GetEncoding($InCharset)
$Text = [System.IO.File]::ReadAllText($InFile, $Encoding)
# Write output file in the destination encoding
$Encoding = [System.Text.Encoding]::GetEncoding($OutCharset)
[System.IO.File]::WriteAllText($OutFile, $Text, $Encoding)
}
Recode Windows-1252 "$pwd\in.txt" utf8 "$pwd\out.txt"
For a list of supported encoding names:
https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding

There is also a web tool to convert file encoding: https://webtool.cloud/change-file-encoding
It supports wide range of encodings, including some rare ones, like IBM code page 37.

Use this Python script: https://github.com/goerz/convert_encoding.py
Works on any platform. Requires Python 2.7.

My favorite tool for this is Jedit (a java based text editor) which has two very convenient features :
One which enables the user to reload a text with a different encoding (and, as such, to control visually the result)
Another one which enables the user to explicitly choose the encoding (and end of line char) before saving

If macOS GUI applications are your bread and butter, SubEthaEdit is the text editor I usually go to for encoding-wrangling — its "conversion preview" allows you to see all invalid characters in the output encoding, and fix/remove them.
And it's open-source now, so yay for them 😉.

Visual Studio Code
Open your file in Visual Studio Code
Reopen with Encoding: In the bottom status bar, to the right, you should see your current file encoding (eg "UTF-8"). Click this and select "Reopen with Encoding".
Select the correct encoding of the file (eg: ISO 8859-2).
Confirm that your content is displaying as expected.
Save with Encoding: The bottom status bar should now display your new encoding format (eg: ISO 8859-2). Click this and choose "Save with Encoding" and select UTF-8 (or whatever new encoding you want).
NOTE: THIS WILL OVERWRITE YOUR ORGINIAL FILE. MAKE A BACKUP FIRST.

As described on How do I correct the character encoding of a file? Synalyze It! lets you easily convert on OS X between all encodings supported by the ICU library.
Additionally you can display some bytes of a file translated to Unicode from all the encodings to see quickly which is the right one for your file.

Eclipse wrong Java properties UTF-8 encoding

I have a JavaEE project, in which I use message properties files. The encoding of those file is set to UTF-8. In the file I use the german umlauts like ä, ö, ü. The problem is, sometimes those characters are replaced with unicode like \uFFFD\uFFFD, but not for every character. Now, I have a case where ä and ü are both replaced with \uFFFD\uFFFD, but not for every occurring of ä and ü.
The Git diff shows me something like this:
mail.adresses=E-Mail hinzufügen:
-mail.adresses.multiple=E-Mails durch Kommata getrennt hinzufügen.
+mail.adresses.multiple=E-Mails durch Kommata getrennt hinzuf\uFFFD\uFFFDgen.
mail.title=Einladungs-E-Mail
box.preview=Vorschau
box.share.text=Sie können jetzt die ausgewählten Bilder mit Ihren Freunden teilen.
## -6880,7 +6880,7 ## browser.cancel=Abbrechen
browser.selectImage=übernehmen
browser.starImage=merken
browser.removeImage=Löschen
-browser.searchForSimilarImages=ähnliche
+browser.searchForSimilarImages=\uFFFD\uFFFDhnliche
browser.clear_drop_box=löschen
Also, there are lines changed, which I have not touched. I don't understand why I get such a behavior. What could be the cause for the above problem?
My system:
Antergos / Arch Linux
System encoding UTF-8
Python 3.5.0 (default, Sep 20 2015, 11:28:25)
[GCC 5.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
Eclipse Mars 1
Text file encoding UTF-8
Properties file encoding UTF-8
Tomcat 8
Java JDK 8
If I use another Editor like Atom to edit those message properties files, I don't ran into this problem.
I also realized in a case, if I copy the original value browser.searchForSimilarImages=ähnliche from Git diff and replace the wrong value browser.searchForSimilarImages=\uFFFD\uFFFDhnliche in Eclipse with that, then I have the correct umlauts in the message properties file.

Root cause:
By default ISO 8859-1 character encoding is used for Eclipse properties file (read here), so if the file contains any character beyond ISO 8859-1 then it will not be processed as expected.
Solution 1
If you use Eclipse then you will notice that it implicitly converts the special character into \uXXXX equivalent. Try copying
会意字 / 會意字
into a properties file opened in Eclipse.
EDIT: As per comment from OP
Update the encoding of your Eclipse as shown below. If you set encoding as UTF-32 then even you can see Chinese character, which you cannot see generally.
How to change Encoding of properties file in Eclipse: See this Eclipse Bugzilla bug for more details, which talks about several other possibilities and in the end suggest what I have highlighted below.
Chinese characters can be seen in Eclipse after encoding is set properly:
Solution 2
If above doesn't work consistently for you (it does work for me and I never see encoding issues) then try this using some Eclipse plugin which handles encoding of properties or other files. For example Eclipse ResourceBundle Editor or Extended Resource-Bundle editor
I would recommend using Eclipse ResourceBundle Editor.
Solution 3
Another possibility to change encoding of file is using Edit --> Set Encoding option. It really matters because it changes the default character set and file encoding. Play around with by changing encoding using Edit --> Set Encoding option and do following Java sysout System.out.println("Default Charset=" + Charset.defaultCharset()); and System.out.println(System.getProperty("file.encoding"));
As an aside: 1
Process the properties file to have content with ISO 8859-1 character encoding by using native2ascii - Native-to-ASCII Converter
What native2ascii does: It converts all the non-ISO 8859-1 character in their equivalent \uXXXX. This is a good tool because you need not to search the \uXXXX equivalent of special character.
Usage for UTF-8: native2ascii -encoding utf8 e:\a.txt e:\b.txt
As an aside: 2
Every computer program whether an IDE, application server, web server, browser, etc. understands only bits, so it need to know how to interpret the bits to make expected sense out of it because depending upon encoding used, same bits can represent different characters. And that's where "Encoding" comes into picture by giving a unique identifier to represent a character so that all computer programs, diverse OS etc. knows exact right way to interpret it.
So, if you have written into a file using some encoding scheme, lets say UTF-8, and then reading using any editor but running with encoding scheme as UTF-8 then you can expect to get correct display.
Please do read my this answer to get more details but from browser-server perspective.

Add the following arguments to your eclipse.ini file.
-Dclient.encoding.override=UTF-8
-Dfile.encoding=UTF-8
By default Eclipse uses the encoding format picked up by the Java Virtual Machine (JVM). Also, you can set the file encoding to utf-8.

Resolved by doing the below changes :
Modified below properties in eclipse.ini and close and start the eclipse applications
-Dclient.encoding.override=UTF-8
-Dfile.encoding=UTF-8
Set the encoding to the UTF-8 [Navigation path : Edit -> Set encoding]

Properties Files are expected to be ISO-8859-1 (Latin-1) encoded.
Most likely this what eclipse was set to by default as well.
You have to make sure that every tool which is run in the build or whatever disregards the spec and uses UTF-8 instead.

This looks like a mixture of Eclipse and git encoding or rather not-encoding.
Git uses raw bytes and doesn't care about encoding. Using git diff you might get characters like shown here. An example there is R<C3><BC>ckg<C3><A4>ngig # should be "Rückgängig".
As you can see there's two funny bracket things showing per umlaut. And in your editor, there are always two \uFFFD for each umlaut in the lines starting with +.
So I assume that your UTF-8 editor tries to interpret the git notation and fails. This in turn leads to the representation \uFFFD, which basically meands that this is character whose value is unknown or unrepresentable (see here).
Like suggested in the first link, you can try setting LESSCHARSET=UTF-8 in your environment variable (Windows). Hmm, in Linux it should be in etc/profile ?

see: a marker such as FFFD (REPLACEMENT CHARACTER) in http://unicode.org/faq/utf_bom.html
and see native2ascii --help
-encoding encoding_name
Specifies the name of the character encoding to be used by the conversion procedure. If this option is not present, then the
default character encoding (as determined by the java.nio.charset.Charset.defaultCharset method) is used. The encoding_name
string must be the name of a character encoding that is supported by the JRE. See Supported Encodings at
http://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
a case
$ file yourfile.properties
yourfile.properties : ISO-8859 text, with very long lines
$ native2ascii -encoding ISO-8859-1 yourfile.properties yourfile.properties

How to Download MS 932 Text File Encoding for Eclipse?

I want to run a java file in eclipse, but it has some essential Japanese characters in order to run. I went to Window->Preferences->Text File Encoding->Other, but there is no MS 932 listed. How can I get this library to Eclipse. I tried checking online, of course, but nothing. I tried switching to UTF-8, but still does not work.

The encodings might be restricted; try maybe SJIS too.
native2ascii -reverse -encoding MS932 My.java temp.txt
... rename My.Java
native2ascii -encoding UTF-8 temp.txt My.java

Ant execution output encoding? How to use UTF-8?

I have a file encoded in UTF-8 which I want to read in java, change some things in the input and print the result to terminal (standard output) and to another file. I read and write the files and write to stdout with streams constructed to interpret UTF-8 encoding.
Everything is fine when I'm manually compiling and running everything, the output file contains UTF-8 signs, the stdout also prints them to terminal.
The problem is when I want to compile and run the program using ant. The output (written to terminal) produced by ant doesn't seem to use UTF-8 signs, as Polish diactrics are changed to '?'. Is there any way of forcing ant to use UTF-8? Also, can I check somehow which encoding is it using at present?
I searched for an answer, but all I found was how to make ant interpret UTF-8 encoded .java files.

You could try setting -Dfile.encoding=UTF-8 this will set the encoding to UTF-8.
You may also want to check if your console encoding is UTF-8 (depends on the OS).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.