Visual studio code java compile not work with UTF-8 - java

I am having a bug with the java debug plugin in my VS Code on Windows 10. When I try to print some special characters with this code :
public class App {
public static void main(String[] args) {
String hello = "こんにちは世界!";
System.out.println(hello);
FileOutputStream oStream = new FileOutputStream("output.txt");
Writer out = new BufferedWriter(new OutputStreamWriter(oStream, "UTF-8"));
try {
out.write(hello);
} finally {
out.close();
}
}
}
If I set the file encode is UTF-8 , when I debug my code in Vs Code, the debug console and the file show broken characters, if I change file encode to UTF-8 with BOM both debug console and the file show correct characters. But changing file encode to UTF-8 is not a solution because java does not support UTF-8 encoding does not recognize initial BOM . So changing file encode to UTF-8 with BOM break my project, because every time i compile my code with javac or build tools like gradle, maven ..., it will throw this error :
> gradle build
> Task :compileJava FAILED
D:\Workspace\Code\~SourceCode\Java\TestEncode\src\main\java\App.java:1: error: illegal character: '\ufeff'
?import java.io.BufferedWriter;
^
(I don't know what magic M$ use to make UTF-8 with BOM work heare!)
Does anybody know a fix or workaround for this?

Related

Vscode doesn't recognize Umlaute (äöü) when reading and writing files with Java

I have a Java project which reads from a .txt file and counts the frequency of every word and saves every word along with its frequency in a .stat file. The way I do this is by reading the file with a BufferedReader, using replaceAll to replace all special characters with spaces and then iterating through the words and finally writing into a .stat with a PrintWriter.
This program works fine if I run it in Eclipse.
However, if I run it in VSCode, the Umlaute (äöü) get recognized as Special characters and are removed from the words.
If I don't use a replaceAll and leave all the special characters in the text, they will get recognized and displayed normally in the .stat.
If I use replaceAll("[^\\p{IsAlphabetic}+]"), the Umlaute will get replaced by all kinds of weird Unicode characters (for Example Ăbermut instead of Übermut).
If I use replaceAll("[^a-zA-ZäöüÄÖÜß]"), the Umlaute will just get replaced by spaces. The same happens if I mention the Umlaute via their Unicode.
This has to be a problem with the encoding in VSCode or perhaps Powershell, as it works fine in other IDEs.
I already checked if Eclipse and VSCode use the same Jdk version, which they did. It's 17.0.5 and the only one installed on my machine.
I also tried out all the different encoding settings in VSCode and I recreated the project from scratch after changing the settings, to no avail.
Here's the code of the minimal reproducable problem:
import java.io.*;
public class App {
static String s;
public static void main(String[] args) {
Reader reader = new Reader();
reader.readFile();
}
}
public class Reader {
public void readFile() {
String s = null;
File file = new File("./src/textfile.txt");
try (FileReader fileReader = new FileReader(file);
BufferedReader bufferedReader = new BufferedReader(fileReader);) {
s = bufferedReader.readLine();
} catch (FileNotFoundException ex) {
// TODO: handle exception
} catch (IOException ex) {
System.out.println("IOException");
}
System.out.println(s);
System.out.println(s.replaceAll("[a-zA-ZäöüÄÖÜß]", " "));
}
}
My textfile.txt contains the line "abcABCäöüÄÖÜß".
The above program outputs
abcABCäöü����
 äöü����
Which shows that the problem is presumably in the Reader, as the glibberish Unicode symbols don't get picked up by the replaceAll.
I solved it by explicitly turning all java files and all .txt files into UTF-8 encoding (in the bottom bar in VSCode), setting UTF-8 as the standard encoding in the VSCode settings and modifying both the FileReader and FileWriter to work with the UTF-8 encoding like this:
FileReader fileReader = new FileReader(file, Charset.forName("UTF-8"));
FileWriter fileWriter = new FileWriter(file, Charset.forName("UTF-8"));

Illegal character error when compiling a program made using geany \u0000

I have been using Geany to create Java programs, where until now I was able to compile them successfully. The simple program created below in Java was made using Geany, however the illegal character error (\u0000) occurred.
public class SumOfCubedDigits
{
public static void main(String[] args)
{
for (int i=1; i<=9; i++)
{
for (int j=0; j<=9; j++)
{
for (int k=0; k<=9; k++)
{
double iCubed=Math.pow(i,3);
double jCubed=Math.pow(j,3);
double kCubed=Math.pow(k,3);
double cubedDigits = iCubed + jCubed + kCubed;
int concatenatedDigits = (i*100 + j*10 + k);
if (cubedDigits==concatenatedDigits)
{
System.out.println(concatenatedDigits);
}
}
}
}
}
}
I recreated the program in nano and it was able to compile successfully. I then copied it across to Geany under a different name of SumTest.java, compiled it and got the same illegal character error. Clearly the error is with the Geany IDE for Raspberry Pi. I'd like to know how I could fix the editor to create and compile programs successfully as it not just this program, it is any program created in Java using Geany.
This might be a problem with encoding that Geany uses when saving the source file.
If you compile the file with javac without specifying the -encoding parameter the platform's default encoding is used. On a modern Linux this is likely to be UTF-8; on Windows it is one of the ANSI character sets or UTF-16, I think.
To find out what the default encoding is, you can compile and run a small java program:
public class DefaultCharsetPrinter {
public static void main(String[] argv) {
System.out.println(Charset.defaultCharset());
}
}
This should print the name of the default encoding used by java programs.
In Geany you can set the file encoding in menu Document > Set Encoding. You need to set this to the same value used by javac. The Geany manual describes additional options for setting the encoding.
As you are seeing a lot errors complaining about the null character it is most likely that Geany stores the file in an encoding with multiple bytes per character (for instance UTF-16) while javac uses an encoding with a single byte per character. If I save your source file as UTF-16 and then try to compile it with javac using UTF-8 encoding, I get the same error messages that you see. After saving the file as UTF-8 in Geany, the file compiles without problems.
I had the same problem with a file i generated using the command echo echo "" > Main.java in Windows Powershell.
I searched the problem and it seemed to have something to do with encoding. I checked the encoding of the file using file -i Main.java and the result was text/plain; charset=utf-16le.
Later i deleted the file and recreated it using git bash using touch Main.java and with this the file compiled successfully. I checked the file encoding using file -i command and this time the result was Main.java: text/x-c; charset=us-ascii.
Next i searched the internet and found that to create an empty file using Powershell we can use the Cmdlet New-Item. I create the file using New-Item Main.java and checked it's encoding and this time the result was Main.java: text/x-c; charset=us-ascii and this time it compiled successully.

Different Result in Java Netbeans Program

I am working on a small program to find text in a text file but I am getting a different result depending how I run my program.
When running my program from Netbeans I get 866 matches.
When running my program by double clicking on the .jar file in the DIST folder, I get 1209 matches (The correct number)
It seems that when I'm running the program from Netbeans, it doesn't get to the end of the text file. Is that to be expected ?
Text File in question
Here is my code for reading the file:
#FXML
public void loadFile(){
//Loading file
try{
linelist.clear();
aclist.clear();
reader = new Scanner(new File(filepathinput));
while(reader.hasNext()){
linelist.add(reader.nextLine());
}
for(int i = 0; i < linelist.size()-1; i++){
if(linelist.get(i).startsWith("AC#")){
aclist.add(linelist.get(i));
}
}
}
catch(java.io.FileNotFoundException e){
System.out.println(e);
}
finally{
String accountString = String.valueOf(aclist.size());
account.setText(accountString);
reader.close();
}
}
The problem is an incompatibility between the java app's (i.e. JVM) default file encoding and the input file's encoding.
The file's encoding is "ANSI" which commonly maps to Windows-1252 encoding (or its variants) on Windows machines.
When running the app from the command prompt, the JVM (so the Scanner implicitly) will take the system default file encoding which is Windows-1252. Reading the same encoded file with this setup will not cause the problem.
However, Netbeans by default sets the project encoding to utf-8, therefore when running the app from Netbeans its file encoding is utf-8. Reading the file with this encoding resulting to confusion of the scanner. The character "ï" (0xEF) of the text "Caraïbes" is the cause of the problem. Since it is one of characters of BOM ( = 0xEF 0xBB 0xBF) sequence, it is somehow messing up the scanner.
As a solution,
either specify the encoding type of the scanner explicitly
reader = new Scanner(file, "windows-1252");
or convert the input file encoding to utf-8 using notepad or better notepad++, and set encoding type to utf-8 without using system default.
reader = new Scanner(file, "utf-8");
However, when the different OSes are considered, working with utf-8 at all places will the preferred way dealing with multi-platform environments. Hence the 2nd way is to go.
It can also depend on the filepathinput input. When jar and netbeans both might be referring to two different files. Possibly with same name in different location. Can you give more information on the filepathinput variable value?

Java text encoding

I read lines from a .txt file into a String list. I show the text in a JTextPane. The encoding is fine when running from Eclipse or NetBeans, however if I create a jar, the encoding is not correct. The encoding of the file is UTF-8. Is there a way to solve this problem?
Your problem is probably that you're opening a reader using the platform encoding.
You should manually specify the encoding whenever you convert between bytes and characters. If you know that the appropriate encoding is UTF-8 you can open a file thus:
FileInputStream inputFile = new FileInputStream(myFile);
try {
FileReader reader = new FileReader(inputFile, "UTF-8");
// Maybe buffer reader and do something with it.
} finally {
inputFile.close();
}
Libraries like Guava can make this whole process easier..
Have you tried to run your jar as
java -Dfile.encoding=utf-8 -jar xxx.jar

java.io.File: accessing files with invalid filename encodings

Because the constructor of java.io.File takes a java.lang.String as argument, there is seemingly no possibility to tell it which filename encoding to expect when accessing the filesystem layer. So when you generally use UTF-8 as filename encoding and there is some filename containing an umlaut encoded as ISO-8859-1, you are basically **. Is this correct?
Update: because noone seemingly gets it, try it yourself: when creating a new file, the environment variable LC_ALL (on Linux) determines the encoding of the filename. It does not matter what you do inside your source code!
If you want to give a correct answer, demonstrate that you can create a file (using regular Java means) with proper ISO-8859-1 encoding while your JVM assumes LC_ALL=en_US.UTF-8. The filename should contain a character like ö, ü, or ä.
BTW: if you put filenames with encoding not appropriate to LC_ALL into maven's resource path, it will just skip it....
Update II.
Fix this: https://github.com/jjYBdx4IL/filenameenc
ie. make the f.exists() statement become true.
Update III.
The solution is to use java.nio.*, in my case you had to replace File.listFiles() with Files.newDirectoryStream(). I have updated the example at github. BTW: maven seems to still use the old java.io API.... mvn clean fails.
The solution is to use the new API and file.encoding. Demonstration:
fge#alustriel:~/tmp/filenameenc$ echo $LC_ALL
en_US.UTF-8
fge#alustriel:~/tmp/filenameenc$ cat Test.java
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class Test
{
public static void main(String[] args)
{
final String testString = "a/üöä";
final Path path = Paths.get(testString);
final File file = new File(testString);
System.out.println("Files.exists(): " + Files.exists(path));
System.out.println("File exists: " + file.exists());
}
}
fge#alustriel:~/tmp/filenameenc$ install -D /dev/null a/üöä
fge#alustriel:~/tmp/filenameenc$ java Test
Files.exists(): true
File exists: true
fge#alustriel:~/tmp/filenameenc$ java -Dfile.encoding=iso-8859-1 Test
Files.exists(): false
File exists: true
fge#alustriel:~/tmp/filenameenc$
One less reason to use File!
Currently I am sitting at a Windows machine, but assuming you can fetch the file system encoding:
String encoding = System.getProperty("file.encoding");
String encoding = system.getEnv("LC_ALL");
Then you have the means to check whether a filename is valid. Mind: Windows can represent Unicode filenames, and my own Linux of course uses UTF-8.
boolean validEncodingForFileName(String name) {
try {
byte[] bytes = name.getBytes(encoding);
String nameAgain = new String(bytes, encoding);
return name.equals(nameAgain); // Nothing lost?
} catch (UnsupportedEncodingException ex) {
return false; // Maybe true, more a JRE limitation.
}
}
You might try whether File is clever enough (I cannot test it):
boolean validEncodingForFileName(String name) {
return new File(name).getCanonicalPath().endsWith(name);
}
How I fixed java.io.File (on Solaris 5.11):
set the LC_* environment variable(s) in the shell/globally.
eg. java -DLC_ALL="en_US.ISO8859-1" does not work!
make sure the set locale is installed on the system
Why does that fix it?
Java internally calls nl_langinfo() to find out the encoding of paths on the HD, which does not notice environment variables set "for java" via -DVARNAME.
Secondly, this falls back to C/ASCII if the locale set by eg. LC_ALL is not installed.
String can represent any encoding:
new File("the file name with \u00d6")
or
new File("the file name with Ö")
You can set the Encoding while reading and writing the File. as a example when you write to file you can give the encoding to your out put stream writer as follows. new OutputStreamWriter(new FileOutputStream(fileName), "UTF-8") .
When you read a file you can give the decoding character set as flowing class constructor . InputStreamReader(InputStream in, CharsetDecoder dec)

Categories

Resources