I have a file encoded in UTF-8 which I want to read in java, change some things in the input and print the result to terminal (standard output) and to another file. I read and write the files and write to stdout with streams constructed to interpret UTF-8 encoding.
Everything is fine when I'm manually compiling and running everything, the output file contains UTF-8 signs, the stdout also prints them to terminal.
The problem is when I want to compile and run the program using ant. The output (written to terminal) produced by ant doesn't seem to use UTF-8 signs, as Polish diactrics are changed to '?'. Is there any way of forcing ant to use UTF-8? Also, can I check somehow which encoding is it using at present?
I searched for an answer, but all I found was how to make ant interpret UTF-8 encoded .java files.
You could try setting -Dfile.encoding=UTF-8 this will set the encoding to UTF-8.
You may also want to check if your console encoding is UTF-8 (depends on the OS).
Related
I'm currently developing a proyect about chess. Main idea is to use it on console as CMD.
It currently works with array[8][8], i store the "chess pieces" on it. But the main problem is:
When i want to print an emoji as "♜, ♞, ♝, and so on", output displays the emojis as ?.
I have already tried some things like UTF-8, Emoji-Java library, changing the Fonts of output console with compatible Fonts... I've tried for hours, i have searched around the internet, i can't find anything... If you help me i'd appreciate it.
[?][?][?][?][?][?][?][?]//♜, ♞, ♝, ♛, ♚, ♝, ♞, ♜.
[null][null][null][null][null][null][null][null]//Null= available space to move
[null][null][null][null][null][null][null][null]
[null][null][null][null][null][null][null][null]
[null][null][null][null][null][null][null][null]
[null][null][null][null][null][null][null][null]
[null][null][null][null][null][null][null][null]
[null][null][null][null][null][null][null][null]
//Please ignore the null values, it's going to be fixed when the problem is solved...
It's complicated, very complicated, and it differs by the OS and it also differs by the version (windows 7 vs 10), and it differs by the patch level (eg windows 10 before and after patch 2004 for example).
So let me save you hours of further heartache by suggesting that you use a UI instead where you can control the underlying character set. For example, using Swing or JavaFX.
However, if you insist on using the console then you need to take a number of steps.
The first being to use a PrintWriter in your code to write out characters using the correct encoding:
PrintWriter consoleOut = new PrintWriter(new OutputStreamWriter(System.out, StandardCharsets.UTF_8));
consoleOut.println("your character here");
The next step is to pre-configure the console to use your character set. For example in windows you might use the chcp command before starting your jar file:
chcp 65001
java -jar .....
But not only that, you should use the Dfile.encoding flag when you start your jar:
java -Dfile.encoding=UTF-8 -jar yourChessApplciation.jar
Now assuming you got all those steps right it might work, but it might not. You also need to ensure that all your source files are encoded in UTF-8. I won't go into that here because it differs by this IDE, but if you are using something like Netbeans then you can configure the source encoding in the project properties.
I would also encourage you to use the Unicode character definition rather than the actual symbol in your code:
//Avoid this, it may fail for a number of reasons (mostly encoding related)
consoleOut.println("♜");
//The better way to write the character using the unicode definition
consoleOut.println("\u265C");
Now even with all this you still need to ensure that your chosen console uses the correct character set. Here are the steps to follow for powershell: Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10) Or for windows cmd you can take a look here: How to make Unicode charset in cmd.exe by default
So with all of those steps completed you can compile this code:
PrintWriter consoleOut = new PrintWriter(new OutputStreamWriter(System.out, StandardCharsets.UTF_8));
consoleOut.println("Using UTF_8 output with the character: ♜");
consoleOut.println("Using UTF_8 output with the unicode definition: \u265C");
consoleOut.close();
And then run your compiled jar file in your console (Powershell in this example) something like this (You wont need to use chcp 65001 if you configured the powershell console correctly):
chcp 65001
java -Dfile.encoding=UTF-8 -jar yourChessApplciation.jar
And the output should give the following result:
Using UTF_8 output with the character: ♜
Using UTF_8 output with the unicode definition: ♜
But it might still fail to show correctly, in which case see my opening section about using a UI, or try a different console... It's complicated.
The source encoding of .java files in our Maven project which is stored in Subversion mostly ASCII and some files are UTF-8.
I think the intention was that these files would be UTF-8. In the pom file the source encoding is specified as UTF-8.
Now our build fails specifically our SonarQube analysis fails on a .java file which is ISO-8859 and which has a variable with a special character. Using a special character is not a good idea think but that aside, shouldn't the java files have consistent (UTF-8) encoding?
Or does it not matter that most are ASCII and only some are UTF-8? It is the thought that counts?
I btw don't understand how these files end up with ASCII encoding. When I use a IDE or editor like SublimeText files end up as UTF-8.
ASCII I only get when I use NotePad on MS Windows. Java developers do not typically use that for programming.
Should we change the source files to use UTF-8? Or maybe it doens't matter and we can leave this as it is?
As an example. Using MS Windows I create one file using SublimeText and one file using Notepad.exe. I put text 1234Ï in those files. The text contains a special character I with two dots.
When I look at these file on Linux using file
ostraaten#io:/tmp/iconv$ file sublimtext.txt
sublimtext.txt: UTF-8 Unicode (with BOM) text, with no line terminators
ostraaten#io:/tmp/iconv$ file notepad.txt
notepad.txt: ISO-8859 text, with no line terminators
ostraaten#io:/tmp/iconv$
So this shows Notepad saved the file as ISO-8859 regardless of the contents. When I check the files using iconv
ostraaten#io:/tmp/iconv$ iconv -f UTF-8 notepad.txt -o /dev/null
iconv: incomplete character or shift sequence at end of buffer
ostraaten#io:/tmp/iconv$ iconv -f UTF-8 sublimtext.txt -o /dev/null
ostraaten#io:/tmp/iconv$
I can open and save the file notepad.txt using SublimeText, the encoding still shows up as ISO-8859.
The character does display correctly in both files. So this support the idea that somewhere the editor tries to determine encoding from the contents of the file. But somewhere else the file is still marked and recognized as ISO-8859.
I can change the encoding using iconv
ostraaten#io:/tmp/iconv$ iconv -f ISO-8859-15 -t UTF-8 notepad.txt > notepad-utf8.txt
ostraaten#io:/tmp/iconv$ file notepad-utf8.txt
notepad-utf8.txt: UTF-8 Unicode text, with no line terminators
ostraaten#io:/tmp/iconv$
straaten#io:/tmp/iconv$ iconv -f UTF-8 notepad-utf8.txt -o /dev/null
The conversion was successful because the message incomplete character is gone.
Seven bits ASCII is a subset of UTF-8. ISO-8859-1 is Latin 1 with some 8 bits problematic bytes.
So someone worked around UTF-8 with editor or IDE. Some version control checkins substitute text back into the source, but in your case that seems not to be the case.
UTF-8 is a solid choice, though needs some care.
I have a program that writes text data to files. When I run it from netbeans the files are in a correct encoding and you can read them with a notepad. When I run it from cmd using java -cp ....jar the encoding is different.
What may be the issue??
ps. I've checked that the jre. versions are the same that executes (v 1.8.0_31)
Netbeans startup scripts may specify a different encoding than your system default. You can check in your netbeans.conf.
You can set the file.encoding property when invoking java. For example, java -Dfile.encoding=UTF8 -cp... jar.
If you do not want to be surprised when running your code on different environments, even better solution would be to specify the encoding in your source code.
Further reading:
file encoding: Character and Byte Streams
netbeans.conf encoding options: How To Display UTF8 In Netbeans 7?
I have a JavaEE project, in which I use message properties files. The encoding of those file is set to UTF-8. In the file I use the german umlauts like ä, ö, ü. The problem is, sometimes those characters are replaced with unicode like \uFFFD\uFFFD, but not for every character. Now, I have a case where ä and ü are both replaced with \uFFFD\uFFFD, but not for every occurring of ä and ü.
The Git diff shows me something like this:
mail.adresses=E-Mail hinzufügen:
-mail.adresses.multiple=E-Mails durch Kommata getrennt hinzufügen.
+mail.adresses.multiple=E-Mails durch Kommata getrennt hinzuf\uFFFD\uFFFDgen.
mail.title=Einladungs-E-Mail
box.preview=Vorschau
box.share.text=Sie können jetzt die ausgewählten Bilder mit Ihren Freunden teilen.
## -6880,7 +6880,7 ## browser.cancel=Abbrechen
browser.selectImage=übernehmen
browser.starImage=merken
browser.removeImage=Löschen
-browser.searchForSimilarImages=ähnliche
+browser.searchForSimilarImages=\uFFFD\uFFFDhnliche
browser.clear_drop_box=löschen
Also, there are lines changed, which I have not touched. I don't understand why I get such a behavior. What could be the cause for the above problem?
My system:
Antergos / Arch Linux
System encoding UTF-8
Python 3.5.0 (default, Sep 20 2015, 11:28:25)
[GCC 5.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
Eclipse Mars 1
Text file encoding UTF-8
Properties file encoding UTF-8
Tomcat 8
Java JDK 8
If I use another Editor like Atom to edit those message properties files, I don't ran into this problem.
I also realized in a case, if I copy the original value browser.searchForSimilarImages=ähnliche from Git diff and replace the wrong value browser.searchForSimilarImages=\uFFFD\uFFFDhnliche in Eclipse with that, then I have the correct umlauts in the message properties file.
Root cause:
By default ISO 8859-1 character encoding is used for Eclipse properties file (read here), so if the file contains any character beyond ISO 8859-1 then it will not be processed as expected.
Solution 1
If you use Eclipse then you will notice that it implicitly converts the special character into \uXXXX equivalent. Try copying
会意字 / 會意字
into a properties file opened in Eclipse.
EDIT: As per comment from OP
Update the encoding of your Eclipse as shown below. If you set encoding as UTF-32 then even you can see Chinese character, which you cannot see generally.
How to change Encoding of properties file in Eclipse: See this Eclipse Bugzilla bug for more details, which talks about several other possibilities and in the end suggest what I have highlighted below.
Chinese characters can be seen in Eclipse after encoding is set properly:
Solution 2
If above doesn't work consistently for you (it does work for me and I never see encoding issues) then try this using some Eclipse plugin which handles encoding of properties or other files. For example Eclipse ResourceBundle Editor or Extended Resource-Bundle editor
I would recommend using Eclipse ResourceBundle Editor.
Solution 3
Another possibility to change encoding of file is using Edit --> Set Encoding option. It really matters because it changes the default character set and file encoding. Play around with by changing encoding using Edit --> Set Encoding option and do following Java sysout System.out.println("Default Charset=" + Charset.defaultCharset()); and System.out.println(System.getProperty("file.encoding"));
As an aside: 1
Process the properties file to have content with ISO 8859-1 character encoding by using native2ascii - Native-to-ASCII Converter
What native2ascii does: It converts all the non-ISO 8859-1 character in their equivalent \uXXXX. This is a good tool because you need not to search the \uXXXX equivalent of special character.
Usage for UTF-8: native2ascii -encoding utf8 e:\a.txt e:\b.txt
As an aside: 2
Every computer program whether an IDE, application server, web server, browser, etc. understands only bits, so it need to know how to interpret the bits to make expected sense out of it because depending upon encoding used, same bits can represent different characters. And that's where "Encoding" comes into picture by giving a unique identifier to represent a character so that all computer programs, diverse OS etc. knows exact right way to interpret it.
So, if you have written into a file using some encoding scheme, lets say UTF-8, and then reading using any editor but running with encoding scheme as UTF-8 then you can expect to get correct display.
Please do read my this answer to get more details but from browser-server perspective.
Add the following arguments to your eclipse.ini file.
-Dclient.encoding.override=UTF-8
-Dfile.encoding=UTF-8
By default Eclipse uses the encoding format picked up by the Java Virtual Machine (JVM). Also, you can set the file encoding to utf-8.
Resolved by doing the below changes :
Modified below properties in eclipse.ini and close and start the eclipse applications
-Dclient.encoding.override=UTF-8
-Dfile.encoding=UTF-8
Set the encoding to the UTF-8 [Navigation path : Edit -> Set encoding]
Properties Files are expected to be ISO-8859-1 (Latin-1) encoded.
Most likely this what eclipse was set to by default as well.
You have to make sure that every tool which is run in the build or whatever disregards the spec and uses UTF-8 instead.
This looks like a mixture of Eclipse and git encoding or rather not-encoding.
Git uses raw bytes and doesn't care about encoding. Using git diff you might get characters like shown here. An example there is R<C3><BC>ckg<C3><A4>ngig # should be "Rückgängig".
As you can see there's two funny bracket things showing per umlaut. And in your editor, there are always two \uFFFD for each umlaut in the lines starting with +.
So I assume that your UTF-8 editor tries to interpret the git notation and fails. This in turn leads to the representation \uFFFD, which basically meands that this is character whose value is unknown or unrepresentable (see here).
Like suggested in the first link, you can try setting LESSCHARSET=UTF-8 in your environment variable (Windows). Hmm, in Linux it should be in etc/profile ?
see: a marker such as FFFD (REPLACEMENT CHARACTER) in http://unicode.org/faq/utf_bom.html
and see native2ascii --help
-encoding encoding_name
Specifies the name of the character encoding to be used by the conversion procedure. If this option is not present, then the
default character encoding (as determined by the java.nio.charset.Charset.defaultCharset method) is used. The encoding_name
string must be the name of a character encoding that is supported by the JRE. See Supported Encodings at
http://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
a case
$ file yourfile.properties
yourfile.properties : ISO-8859 text, with very long lines
$ native2ascii -encoding ISO-8859-1 yourfile.properties yourfile.properties
My Windows's default encoding is GBK, and my Eclipse is totally utf-8 encoded.
So an application which runs well in my Eclipse, crashes because the words become unreadable when exported as a jar file;
I have to write the following line in a .bat file to run the application
start java -Dfile.encoding=utf-8 -jar xxx.jar
Now my question is that can I write something in the source code to set the application uses(or the jvm runs in) utf-8 instead of the system's default encoding.
When you open a file for reading, you need to explicitly specify the encoding you want to use for reading the file:
Reader r = new InputStreamReader(new FileInputStream("myfile"), StandardCharsets.UTF_8);
Then the value of the default platform encoding (which you can change using -Dfile.encoding) no longer matters.
Note:
I would normally recommend to always specify the encoding explicitly for any operation that depends on the standard locale, such as character I/O. Many Java API methods default to the platform encoding, which I consider a bad design, because often the platform encoding is not the right one, plus it may suddenly change (if the user e.g. switches OS locale), breaking your app.
So just always say which encoding you want.
There are some cases where the platform encoding is the right one (such as when opening a file the user just created for you), but they are fairly rare.
Note 2:
java.nio.charset.StandardCharsets was introduced in Java 1.7. For older Java versions, you need to specify the input encoding as a String (ugh). The list of possible encodings depends on the JVM, but every JVM is guaranteed to at least have:
US-ASCII, ISO-8859-1,UTF-8,UTF-16BE,UTF-16LE,UTF-16.
There's another way.
If you are sure how you like to encode the input and output, you can save the settings before you compile your jar file.
Here is a example for NetBeans.
Go To Project >> Properties >> Run >> VM Options and type -Dfile. encoding=UTF-8
After that, everything is encoded in UTF-8 every time the Java VM is started.
(I think Eclipse offers the same possibility. If not, just google to VM Options.)