Creating file with non english characters in the file name

Creating file with non english characters in the file name - java

How to create a file with chinese character in file name in Java? I have read many stack overflow answers but couldn't find a proper solution. The code I have written till now is as follows:
private void writeFile(String fileContent) {
try{
File file=new File("C:/temp/你好.docx");
if(file.exists()){
file.delete();
}
file.createNewFile();
FileOutputStream fos = new FileOutputStream(file);
fos.write(fileContent);
fos.close();
}
catch(Exception e)
{
e.printStackTrace();
}
}
The file is written to the output directory but the name of the file contains some garbage value. Thanks in advance.

I believe the key is as what I have mentioned in the comment.
You can have Chinese characters (or other non-ascii character) as literals in source code. However you should make sure two things:
The encoding you use to save the source file is capable to represent those character (My suggestion is just to stick with UTF-8)
The -encoding parameter you passed to javac should match the encoding of the file.
For example, if you use UTF-8, your javac command should looks like
javac -encoding utf-8 Foo.java
I have just tried a similar piece of code you have, and it works well in my machine.

Related

Vscode doesn't recognize Umlaute (äöü) when reading and writing files with Java

I have a Java project which reads from a .txt file and counts the frequency of every word and saves every word along with its frequency in a .stat file. The way I do this is by reading the file with a BufferedReader, using replaceAll to replace all special characters with spaces and then iterating through the words and finally writing into a .stat with a PrintWriter.
This program works fine if I run it in Eclipse.
However, if I run it in VSCode, the Umlaute (äöü) get recognized as Special characters and are removed from the words.
If I don't use a replaceAll and leave all the special characters in the text, they will get recognized and displayed normally in the .stat.
If I use replaceAll("[^\\p{IsAlphabetic}+]"), the Umlaute will get replaced by all kinds of weird Unicode characters (for Example Ăbermut instead of Übermut).
If I use replaceAll("[^a-zA-ZäöüÄÖÜß]"), the Umlaute will just get replaced by spaces. The same happens if I mention the Umlaute via their Unicode.
This has to be a problem with the encoding in VSCode or perhaps Powershell, as it works fine in other IDEs.
I already checked if Eclipse and VSCode use the same Jdk version, which they did. It's 17.0.5 and the only one installed on my machine.
I also tried out all the different encoding settings in VSCode and I recreated the project from scratch after changing the settings, to no avail.
Here's the code of the minimal reproducable problem:
import java.io.*;
public class App {
static String s;
public static void main(String[] args) {
Reader reader = new Reader();
reader.readFile();
}
}
public class Reader {
public void readFile() {
String s = null;
File file = new File("./src/textfile.txt");
try (FileReader fileReader = new FileReader(file);
BufferedReader bufferedReader = new BufferedReader(fileReader);) {
s = bufferedReader.readLine();
} catch (FileNotFoundException ex) {
// TODO: handle exception
} catch (IOException ex) {
System.out.println("IOException");
}
System.out.println(s);
System.out.println(s.replaceAll("[a-zA-ZäöüÄÖÜß]", " "));
}
}
My textfile.txt contains the line "abcABCäöüÄÖÜß".
The above program outputs
ï»¿abcABCÃ¤Ã¶Ã¼Ã?Ã?Ã?Ã?
ï»¿ Ã¤Ã¶Ã¼Ã?Ã?Ã?Ã?
Which shows that the problem is presumably in the Reader, as the glibberish Unicode symbols don't get picked up by the replaceAll.

I solved it by explicitly turning all java files and all .txt files into UTF-8 encoding (in the bottom bar in VSCode), setting UTF-8 as the standard encoding in the VSCode settings and modifying both the FileReader and FileWriter to work with the UTF-8 encoding like this:
FileReader fileReader = new FileReader(file, Charset.forName("UTF-8"));
FileWriter fileWriter = new FileWriter(file, Charset.forName("UTF-8"));

Different Result in Java Netbeans Program

I am working on a small program to find text in a text file but I am getting a different result depending how I run my program.
When running my program from Netbeans I get 866 matches.
When running my program by double clicking on the .jar file in the DIST folder, I get 1209 matches (The correct number)
It seems that when I'm running the program from Netbeans, it doesn't get to the end of the text file. Is that to be expected ?
Text File in question
Here is my code for reading the file:
#FXML
public void loadFile(){
//Loading file
try{
linelist.clear();
aclist.clear();
reader = new Scanner(new File(filepathinput));
while(reader.hasNext()){
linelist.add(reader.nextLine());
}
for(int i = 0; i < linelist.size()-1; i++){
if(linelist.get(i).startsWith("AC#")){
aclist.add(linelist.get(i));
}
}
}
catch(java.io.FileNotFoundException e){
System.out.println(e);
}
finally{
String accountString = String.valueOf(aclist.size());
account.setText(accountString);
reader.close();
}
}

The problem is an incompatibility between the java app's (i.e. JVM) default file encoding and the input file's encoding.
The file's encoding is "ANSI" which commonly maps to Windows-1252 encoding (or its variants) on Windows machines.
When running the app from the command prompt, the JVM (so the Scanner implicitly) will take the system default file encoding which is Windows-1252. Reading the same encoded file with this setup will not cause the problem.
However, Netbeans by default sets the project encoding to utf-8, therefore when running the app from Netbeans its file encoding is utf-8. Reading the file with this encoding resulting to confusion of the scanner. The character "ï" (0xEF) of the text "Caraïbes" is the cause of the problem. Since it is one of characters of BOM (ï»¿ = 0xEF 0xBB 0xBF) sequence, it is somehow messing up the scanner.
As a solution,
either specify the encoding type of the scanner explicitly
reader = new Scanner(file, "windows-1252");
or convert the input file encoding to utf-8 using notepad or better notepad++, and set encoding type to utf-8 without using system default.
reader = new Scanner(file, "utf-8");
However, when the different OSes are considered, working with utf-8 at all places will the preferred way dealing with multi-platform environments. Hence the 2nd way is to go.

It can also depend on the filepathinput input. When jar and netbeans both might be referring to two different files. Possibly with same name in different location. Can you give more information on the filepathinput variable value?

Java text encoding

I read lines from a .txt file into a String list. I show the text in a JTextPane. The encoding is fine when running from Eclipse or NetBeans, however if I create a jar, the encoding is not correct. The encoding of the file is UTF-8. Is there a way to solve this problem?

Your problem is probably that you're opening a reader using the platform encoding.
You should manually specify the encoding whenever you convert between bytes and characters. If you know that the appropriate encoding is UTF-8 you can open a file thus:
FileInputStream inputFile = new FileInputStream(myFile);
try {
FileReader reader = new FileReader(inputFile, "UTF-8");
// Maybe buffer reader and do something with it.
} finally {
inputFile.close();
}
Libraries like Guava can make this whole process easier..

Have you tried to run your jar as
java -Dfile.encoding=utf-8 -jar xxx.jar

java.io.File: accessing files with invalid filename encodings

Because the constructor of java.io.File takes a java.lang.String as argument, there is seemingly no possibility to tell it which filename encoding to expect when accessing the filesystem layer. So when you generally use UTF-8 as filename encoding and there is some filename containing an umlaut encoded as ISO-8859-1, you are basically **. Is this correct?
Update: because noone seemingly gets it, try it yourself: when creating a new file, the environment variable LC_ALL (on Linux) determines the encoding of the filename. It does not matter what you do inside your source code!
If you want to give a correct answer, demonstrate that you can create a file (using regular Java means) with proper ISO-8859-1 encoding while your JVM assumes LC_ALL=en_US.UTF-8. The filename should contain a character like ö, ü, or ä.
BTW: if you put filenames with encoding not appropriate to LC_ALL into maven's resource path, it will just skip it....
Update II.
Fix this: https://github.com/jjYBdx4IL/filenameenc
ie. make the f.exists() statement become true.
Update III.
The solution is to use java.nio.*, in my case you had to replace File.listFiles() with Files.newDirectoryStream(). I have updated the example at github. BTW: maven seems to still use the old java.io API.... mvn clean fails.

The solution is to use the new API and file.encoding. Demonstration:
fge#alustriel:~/tmp/filenameenc$ echo $LC_ALL
en_US.UTF-8
fge#alustriel:~/tmp/filenameenc$ cat Test.java
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class Test
{
public static void main(String[] args)
{
final String testString = "a/üöä";
final Path path = Paths.get(testString);
final File file = new File(testString);
System.out.println("Files.exists(): " + Files.exists(path));
System.out.println("File exists: " + file.exists());
}
}
fge#alustriel:~/tmp/filenameenc$ install -D /dev/null a/üöä
fge#alustriel:~/tmp/filenameenc$ java Test
Files.exists(): true
File exists: true
fge#alustriel:~/tmp/filenameenc$ java -Dfile.encoding=iso-8859-1 Test
Files.exists(): false
File exists: true
fge#alustriel:~/tmp/filenameenc$
One less reason to use File!

Currently I am sitting at a Windows machine, but assuming you can fetch the file system encoding:
String encoding = System.getProperty("file.encoding");
String encoding = system.getEnv("LC_ALL");
Then you have the means to check whether a filename is valid. Mind: Windows can represent Unicode filenames, and my own Linux of course uses UTF-8.
boolean validEncodingForFileName(String name) {
try {
byte[] bytes = name.getBytes(encoding);
String nameAgain = new String(bytes, encoding);
return name.equals(nameAgain); // Nothing lost?
} catch (UnsupportedEncodingException ex) {
return false; // Maybe true, more a JRE limitation.
}
}
You might try whether File is clever enough (I cannot test it):
boolean validEncodingForFileName(String name) {
return new File(name).getCanonicalPath().endsWith(name);
}

How I fixed java.io.File (on Solaris 5.11):
set the LC_* environment variable(s) in the shell/globally.
eg. java -DLC_ALL="en_US.ISO8859-1" does not work!
make sure the set locale is installed on the system
Why does that fix it?
Java internally calls nl_langinfo() to find out the encoding of paths on the HD, which does not notice environment variables set "for java" via -DVARNAME.
Secondly, this falls back to C/ASCII if the locale set by eg. LC_ALL is not installed.

String can represent any encoding:
new File("the file name with \u00d6")
or
new File("the file name with Ö")

You can set the Encoding while reading and writing the File. as a example when you write to file you can give the encoding to your out put stream writer as follows. new OutputStreamWriter(new FileOutputStream(fileName), "UTF-8") .
When you read a file you can give the decoding character set as flowing class constructor . InputStreamReader(InputStream in, CharsetDecoder dec)

Read file with whitespace in its path using Java

I am trying to open files with FileInputStream that have whitespaces in their names.
For example:
String fileName = "This is my file.txt";
String path = "/home/myUsername/folder/";
String filePath = path + filename;
f = new BufferedInputStream(new FileInputStream(filePath));
The result is that a FileNotFoundException is being thrown.
I tried to hardcode the filePath to "/home/myUserName/folder/This\\ is\\ my\\ file.txt" just to see if i should escape whitespace characters and it did not seem to work.
Any suggestions on this matter?
EDIT: Just to be on the same page with everyone viewing this question...opening a file without whitespace in its name works, one that has whitespaces fails. Permissions are not the issue here nor the folder separator.

File name with space works just fine
Here is my code
File f = new File("/Windows/F/Programming/Projects/NetBeans/TestApplications/database prop.properties");
System.out.println(f.exists());
try
{
FileInputStream stream = new FileInputStream(f);
}
catch (FileNotFoundException ex)
{
System.out.println(ex.getMessage());
}
f.exists() returns true always without any problem

Looks like you have a problem rather with the file separator than the whitespace in your file names. Have you tried using
System.getProperty("file.separator")
instead of your '/' in the path variable?

No, you do not need to escape whitespaces.
If the code throws FileNotFoundException, then the file doesn't exist (or, perhaps, you lack requisite permissions to access it).
If permissions are fine, and you think that the file exists, make sure that it's called what you think it's called. In particular, make sure that the file name does not contain any non-printable characters, inadvertent leading or trailing whitespaces etc. For this, ls -b might be helpful.

Normally whitespace in path should't matter. Just make sure when you're passing path from external source (like command line), that it doesn't contain whitespace at the end:
File file = new File(path.trim());
In case you want to have path without spaces, you can convert it to URI and then back to path
try {
URI u = new URI(path.trim().replaceAll("\\u0020", "%20"));
File file = new File(u.getPath());
} catch (URISyntaxException ex) {
Exceptions.printStackTrace(ex);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Creating file with non english characters in the file name - java

Related

Vscode doesn't recognize Umlaute (äöü) when reading and writing files with Java

Different Result in Java Netbeans Program

Java text encoding

java.io.File: accessing files with invalid filename encodings

Read file with whitespace in its path using Java

Categories

Resources