java.io.File: accessing files with invalid filename encodings

java.io.File: accessing files with invalid filename encodings - java

Because the constructor of java.io.File takes a java.lang.String as argument, there is seemingly no possibility to tell it which filename encoding to expect when accessing the filesystem layer. So when you generally use UTF-8 as filename encoding and there is some filename containing an umlaut encoded as ISO-8859-1, you are basically **. Is this correct?
Update: because noone seemingly gets it, try it yourself: when creating a new file, the environment variable LC_ALL (on Linux) determines the encoding of the filename. It does not matter what you do inside your source code!
If you want to give a correct answer, demonstrate that you can create a file (using regular Java means) with proper ISO-8859-1 encoding while your JVM assumes LC_ALL=en_US.UTF-8. The filename should contain a character like ö, ü, or ä.
BTW: if you put filenames with encoding not appropriate to LC_ALL into maven's resource path, it will just skip it....
Update II.
Fix this: https://github.com/jjYBdx4IL/filenameenc
ie. make the f.exists() statement become true.
Update III.
The solution is to use java.nio.*, in my case you had to replace File.listFiles() with Files.newDirectoryStream(). I have updated the example at github. BTW: maven seems to still use the old java.io API.... mvn clean fails.

The solution is to use the new API and file.encoding. Demonstration:
fge#alustriel:~/tmp/filenameenc$ echo $LC_ALL
en_US.UTF-8
fge#alustriel:~/tmp/filenameenc$ cat Test.java
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class Test
{
public static void main(String[] args)
{
final String testString = "a/üöä";
final Path path = Paths.get(testString);
final File file = new File(testString);
System.out.println("Files.exists(): " + Files.exists(path));
System.out.println("File exists: " + file.exists());
}
}
fge#alustriel:~/tmp/filenameenc$ install -D /dev/null a/üöä
fge#alustriel:~/tmp/filenameenc$ java Test
Files.exists(): true
File exists: true
fge#alustriel:~/tmp/filenameenc$ java -Dfile.encoding=iso-8859-1 Test
Files.exists(): false
File exists: true
fge#alustriel:~/tmp/filenameenc$
One less reason to use File!

Currently I am sitting at a Windows machine, but assuming you can fetch the file system encoding:
String encoding = System.getProperty("file.encoding");
String encoding = system.getEnv("LC_ALL");
Then you have the means to check whether a filename is valid. Mind: Windows can represent Unicode filenames, and my own Linux of course uses UTF-8.
boolean validEncodingForFileName(String name) {
try {
byte[] bytes = name.getBytes(encoding);
String nameAgain = new String(bytes, encoding);
return name.equals(nameAgain); // Nothing lost?
} catch (UnsupportedEncodingException ex) {
return false; // Maybe true, more a JRE limitation.
}
}
You might try whether File is clever enough (I cannot test it):
boolean validEncodingForFileName(String name) {
return new File(name).getCanonicalPath().endsWith(name);
}

How I fixed java.io.File (on Solaris 5.11):
set the LC_* environment variable(s) in the shell/globally.
eg. java -DLC_ALL="en_US.ISO8859-1" does not work!
make sure the set locale is installed on the system
Why does that fix it?
Java internally calls nl_langinfo() to find out the encoding of paths on the HD, which does not notice environment variables set "for java" via -DVARNAME.
Secondly, this falls back to C/ASCII if the locale set by eg. LC_ALL is not installed.

String can represent any encoding:
new File("the file name with \u00d6")
or
new File("the file name with Ö")

You can set the Encoding while reading and writing the File. as a example when you write to file you can give the encoding to your out put stream writer as follows. new OutputStreamWriter(new FileOutputStream(fileName), "UTF-8") .
When you read a file you can give the decoding character set as flowing class constructor . InputStreamReader(InputStream in, CharsetDecoder dec)

Related

File.separator and "/" work differently in a runnable jar

Context
I have detected a strange behaviour of File.separator when used in a piece of code compiled as a runnable jar.
For context, I have a database that is supported by xml. The program then interacts with that database, parsing the whole document at the start, creating the objects, modifying them, and then updating the xml file when the program ends.
The actual contents and/or file type are actually irrelevant, but I have included a simple Hello World txt as an example. Either way, the file is read as an InputStream by the database (the only difference is that I'm parsing the actual xml from the that InputStream).
MWE
Database.java
import java.io.File;
import java.io.InputStream;
public class Database {
public InputStream getInputStreamDependentSeparator(String fileName) {
return getClass().getResourceAsStream("/resources/" + fileName);
}
public InputStream getInputStreamIndependentSeparator(String fileName) {
return getClass().getResourceAsStream("/resources" + File.separator + fileName);
}
}
Main.java
import java.io.InputStream;
public class Main {
public static void main(String[] args) {
Database db = new Database();
InputStream is1 = db.getInputStreamDependentSeparator("HelloWorld.txt");
InputStream is2 = db.getInputStreamIndependentSeparator("HelloWorld.txt");
System.out.println(is1.toString());
System.out.println(is2.toString());
}
}
HelloWorld.txt
Hello World!
Problem description
is1 uses / to separate directories, while is2 uses File.separator
Running through the IDE (Eclipse), neither statement results in an exception, and produce the following output:
java.io.BufferedInputStream#5305068a
java.io.BufferedInputStream#1f32e575
Compiling the code as a runnable jar (in Eclipse, right-clicking on the project, left-clicking on Export, and then following the prompts for a runnable jar), and running from the command line results in a NullPointerException thrown by is2. This is the following output:
java -jar fileSeparator.jar
sun.net.www.protocol.jar.JarURLConnection$JarURLInputStream#85ede7b
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "Object.toString()" because "is2" is null
at Main.main(Main.java:12)
Note that the System.out.println(); statement is not meant to actually give the content of the HelloWorld.txt file, just to check if the InputStream is valid or trigger an exception otherwise
Questions
Quoting directly from Oracle: EDIT: As Erwin Bolwidt pointed out, this was regarding the atg and is not applicable to jdk. Sorry for not noticing this.
When specifying values for file properties, Nucleus translates the forward slash (/) to the file separator for your platform (for example, Windows uses a backslash (\) as a file separator).
This means that both File.separator and / should provide identical results. However, the MWE shows that the former is not working when running the jar, while the latter works both in the IDE and from the jar. Is this a bug? Or am I missing something, and this is the actual intended behaviour?
I have been using File.separator since, according to this post, it is the safest way to ensure future platform compatibility. Whether this is a bug or an actual intended behaviour, how should I fix the issue? (I could just use the /, but then again I'd lose platform independency...)

File is read on Windows but not on a Linux container?

Like the title says I'm not able to read the contents of a file (csv file) while running the same code on a linux container
private Set<VehicleConfiguration> loadConfigurations(Path file, CodeType codeType) throws IOException {
log.debug("File exists? " + Files.exists(file));
log.debug("Path " + file.toString());
log.debug("File " + file.toFile().toString());
log.debug("File absolute path " + file.toAbsolutePath().toString());
String line;
Set<VehicleConfiguration> configurations = new HashSet<>(); // this way we ignore duplicates in the same file
try(BufferedReader br = new BufferedReader(new FileReader(file.toFile()))){
while ((line = br.readLine()) != null) {
configurations.add(build(line, codeType));
}
}
log.debug("Loaded " + configurations.size() + " configurations");
return configurations;
}
The logs return "true" and the path for the file in both systems (locally on windows and on a linux docker container). On windows it loads "15185 configurations" but on the container it loads "0 configurations".
The file exists on linux, I use bash and check it myself. I use the head command and the file has lines.
Before this I tried with Files.lines like so:
var vehicleConfigurations = Files.lines(file)
.map(line -> build(line, codeType))
.collect(Collectors.toCollection(HashSet::new));
But this has a problem (on container only) regarding the contents. It reads the file but not the whole file, it reaches a given line (say line 8000) and does not read it completely (reads about half a line before the comma separator). Then I get a java.lang.ArrayIndexOutOfBoundsException because my build method tries to split then line and I access index 1 (which it doesn't have, only 0):
private VehicleConfiguration build(String line, CodeType codeType) {
String[] cells = line.split(lineSeparator);
var vc = new VehicleConfiguration();
vc.setVin(cells[0]);
vc.setCode(cells[1]);
vc.setType(codeType);
return vc;
}
What could be the issue? I don't understand how the same code (in Java) works on Windows but not on a Linux container. It makes no sense.
I'm using Java 11. The file is copied using volumes in a docker-compose file like this:
volumes:
- ./file-sources:/file-sources
I then copy the file (using cp command on the linux container) from file-sources to /root because that's where the app is listening for new files to arrive. File contents are then read with the methods I described. Example file data (does not have weird characters):
Thanks in advance.
UPDATE: Tried with newBufferedReader method, same result (works on windows, doesn't work on linux container):
private Set<VehicleConfiguration> loadConfigurations(Path file, CodeType codeType) throws IOException {
String line;
Set<VehicleConfiguration> configurations = new HashSet<>(); // this way we ignore duplicates in the same file
try(BufferedReader br = Files.newBufferedReader(file)){
while ((line = br.readLine()) != null) {
configurations.add(build(line, codeType));
}
}
log.debug("Loaded " + configurations.size() + " configurations");
return configurations;
}
wc -l in the linux container (in /root) returns: 15185 hard_001.csv
Update: This is no solution but I found out that by dropping the files directly on the file-sources folder and make that folder the folder that the code listens to, the files are read. So basically, it seems the problem is more apparent with using cp/mv inside the container to another folder. Maybe the file is read before it is fully copied/moved and that's why it reads 0 configurations?

There are a few methods in java you should never use. ever.
new FileReader(File) is one of them.
Any time that you have a thing that represents bytes and somehow chars or Strings fall out, or vice versa? Don't ever use those, unless the spec of said method explicitly points out that it always uses a pre-set charset. Almost all such methods use the 'system default charset' which means that the operation depends on the machine you run it on. That is shorthand for 'this will fail, and your tests won't catch it'. Which you don't want.
Which is why you should never use these things.
FileReader has been fixed (there is a second constructor that takes a charset), but that's only since JDK11. You already have the nice new API, why do you switch back to the dinky old File API? Don't do that.
All the various methods in Files, such as Files.newBufferedReader, are specced to do UTF-8 if you don't specify (in that way, Files is more useful, and unlike most other java core libraries). Thus:
try (BufferedReader br = Files.newBufferedReader(file)) {
which is just.. better.. than your line.
Now, it'll probably still fail on you. But that's good! It'll also fail on your dev machine. Most likely, the file you are reading is not, in fact, in UTF_8. This is the likely guess; most linuxen are deployed with a UTF_8 default charset, and most dev machines are not; if your dev machine is working and your deployment environment isn't, the obvious conclusion is that your input file is not UTF_8. It does not need to be what your dev machine has a default either; something like ISO_8859_1 will never throw exceptions, but it will read gobbledygook instead. Your code may seem to work (no crashes), but the text you read is still incorrect.
Figure out what text encoding you got, and then specify it. If it's ISO_8859_1, for example:
try (BufferedReader br = Files.newBufferedReader(file, StandardCharsets.ISO_8859_1)) {
and now your code no longer has the 'works on some machines but not on others' nature.
Inspect the line where it fails, in a hex editor if you have to. I bet you dollars to donuts there will be a byte there which is 0x80 or higher (in decimal, 128 or higher). Everything up to and including 127 tends to mean the exact same thing in a wide variety of text encodings, from ASCII to any ISO-8859 variant to UTF-8 Windows Cp1252 to macroman to so many other things, so as long as it's all just plain letters and digits, having the wrong encoding is not going to make any difference. But once you get to 0x80 or higher they're all different. Armed with that byte + some knowledge of what character it is supposed to be is usually a good start in figuring out what encoding that text file is in.
NB: If this isn't it, check how the text file is being copied from your dev machine to your deployment environment. Are you sure it is the same file? If it's being copied through a textual mechanism, charset encoding again can be to blame, but this time in how the file is written, instead of how your java app reads it.

Different Result in Java Netbeans Program

I am working on a small program to find text in a text file but I am getting a different result depending how I run my program.
When running my program from Netbeans I get 866 matches.
When running my program by double clicking on the .jar file in the DIST folder, I get 1209 matches (The correct number)
It seems that when I'm running the program from Netbeans, it doesn't get to the end of the text file. Is that to be expected ?
Text File in question
Here is my code for reading the file:
#FXML
public void loadFile(){
//Loading file
try{
linelist.clear();
aclist.clear();
reader = new Scanner(new File(filepathinput));
while(reader.hasNext()){
linelist.add(reader.nextLine());
}
for(int i = 0; i < linelist.size()-1; i++){
if(linelist.get(i).startsWith("AC#")){
aclist.add(linelist.get(i));
}
}
}
catch(java.io.FileNotFoundException e){
System.out.println(e);
}
finally{
String accountString = String.valueOf(aclist.size());
account.setText(accountString);
reader.close();
}
}

The problem is an incompatibility between the java app's (i.e. JVM) default file encoding and the input file's encoding.
The file's encoding is "ANSI" which commonly maps to Windows-1252 encoding (or its variants) on Windows machines.
When running the app from the command prompt, the JVM (so the Scanner implicitly) will take the system default file encoding which is Windows-1252. Reading the same encoded file with this setup will not cause the problem.
However, Netbeans by default sets the project encoding to utf-8, therefore when running the app from Netbeans its file encoding is utf-8. Reading the file with this encoding resulting to confusion of the scanner. The character "ï" (0xEF) of the text "Caraïbes" is the cause of the problem. Since it is one of characters of BOM (ï»¿ = 0xEF 0xBB 0xBF) sequence, it is somehow messing up the scanner.
As a solution,
either specify the encoding type of the scanner explicitly
reader = new Scanner(file, "windows-1252");
or convert the input file encoding to utf-8 using notepad or better notepad++, and set encoding type to utf-8 without using system default.
reader = new Scanner(file, "utf-8");
However, when the different OSes are considered, working with utf-8 at all places will the preferred way dealing with multi-platform environments. Hence the 2nd way is to go.

It can also depend on the filepathinput input. When jar and netbeans both might be referring to two different files. Possibly with same name in different location. Can you give more information on the filepathinput variable value?

Creating file with non english characters in the file name

How to create a file with chinese character in file name in Java? I have read many stack overflow answers but couldn't find a proper solution. The code I have written till now is as follows:
private void writeFile(String fileContent) {
try{
File file=new File("C:/temp/你好.docx");
if(file.exists()){
file.delete();
}
file.createNewFile();
FileOutputStream fos = new FileOutputStream(file);
fos.write(fileContent);
fos.close();
}
catch(Exception e)
{
e.printStackTrace();
}
}
The file is written to the output directory but the name of the file contains some garbage value. Thanks in advance.

I believe the key is as what I have mentioned in the comment.
You can have Chinese characters (or other non-ascii character) as literals in source code. However you should make sure two things:
The encoding you use to save the source file is capable to represent those character (My suggestion is just to stick with UTF-8)
The -encoding parameter you passed to javac should match the encoding of the file.
For example, if you use UTF-8, your javac command should looks like
javac -encoding utf-8 Foo.java
I have just tried a similar piece of code you have, and it works well in my machine.

Best international alternative to Java's getClass().getResource()

I have a lot of resource files bundled with my Java app. These files have filenames containing international characters like ü or æ. I would like to load these files using getClass().getResource(), but apparently this is not supported since for these particular file names, the getResource method always returns null.
That made me experiment with using URL encoding of the international characters, but this is not supported either as stated by http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4968789.
So, my question is: What is the recommended way of loading a resource which has a name containing international characters? For example, I need to load the UTF-8 contents of a file named Sjælland.txt

Not sure if there is a best (it is probably a candidate for worst because it is quite a hack) but this seems like a capable mechanism. It sidesteps the need to use getResource by reading the jar directly.
public class NavelGazing {
public static void main(String[] args) throws Throwable {
// Do a little navel gazing.
java.net.URL codeBase = NavelGazing.class.getProtectionDomain().getCodeSource().getLocation();
// Must be a jar.
if (codeBase.getPath().endsWith(".jar")) {
// Open it.
java.util.jar.JarInputStream jin = new java.util.jar.JarInputStream(codeBase.openStream());
// Walk the entries.
ZipEntry entry;
while ((entry = jin.getNextEntry()) != null ) {
System.out.println("Entry: "+entry.getName());
}
}
}
}
I added a file called Sjælland.txt and this did successfully get the entry.

I am not sure that I understand you correctly, but if I try
URL url = Test.class.getResource("/Sjælland.txt");
Object o = url.getContent();
then o is a sun.net.www.content.text.PlainTextInputStream.
I'm using JDK 1.6 on a Windows machine. I've got (default?) System.property sun.jnu.encoding set to Cp1252. So it all seems to work fine. The bug you've posted seems to be JDK 1.4. It might be what you're using.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.