Character encoding while parsing a file with Java on Linux

Character encoding while parsing a file with Java on Linux - java

With a java script, I am trying to read a file that contains the name of a file on my Linux filesystem. It points to a file that was generated on a Windows OS and has accents in it's name.
Example of this kind of file "input.csv" :
MYFILE_tést.doc;1
The java program parses the file in question and verifies if that file exists. But for those lines containing accents, file.exists() in Java always returns false.
The file "input.csv" is generated in Windows and encoded in ISO-8859-1
Linux locales are configured like this :
LANG=en_US.ISO-8859-1
LC_CTYPE="en_US.ISO-8859-1"
LC_NUMERIC="en_US.ISO-8859-1"
LC_TIME="en_US.ISO-8859-1"
LC_COLLATE="en_US.ISO-8859-1"
LC_MONETARY="en_US.ISO-8859-1"
LC_MESSAGES="en_US.ISO-8859-1"
LC_PAPER="en_US.ISO-8859-1"
LC_NAME="en_US.ISO-8859-1"
LC_ADDRESS="en_US.ISO-8859-1"
LC_TELEPHONE="en_US.ISO-8859-1"
LC_MEASUREMENT="en_US.ISO-8859-1"
LC_IDENTIFICATION="en_US.ISO-8859-1"
LC_ALL=en_US.ISO-8859-1
When reading the CSV file in java, i'm forcing the encoding :
csvFile = new BufferedReader(new InputStreamReader(new
FileInputStream(FILE_CSV), "ISO-8859-1"));
I tried switching to UTF-8 (OS locales + file encoding) or playing with the -Dfile.encoding=ISO-8859-1 JVM parameter but still the same problem.
The problem doesn't ocur if i hardcode the filename with the accents in the source code instead of reading it in the csv file.
Any ideas of how to fix this ?
Thank you for your help

Related

File is read on Windows but not on a Linux container?

Like the title says I'm not able to read the contents of a file (csv file) while running the same code on a linux container
private Set<VehicleConfiguration> loadConfigurations(Path file, CodeType codeType) throws IOException {
log.debug("File exists? " + Files.exists(file));
log.debug("Path " + file.toString());
log.debug("File " + file.toFile().toString());
log.debug("File absolute path " + file.toAbsolutePath().toString());
String line;
Set<VehicleConfiguration> configurations = new HashSet<>(); // this way we ignore duplicates in the same file
try(BufferedReader br = new BufferedReader(new FileReader(file.toFile()))){
while ((line = br.readLine()) != null) {
configurations.add(build(line, codeType));
}
}
log.debug("Loaded " + configurations.size() + " configurations");
return configurations;
}
The logs return "true" and the path for the file in both systems (locally on windows and on a linux docker container). On windows it loads "15185 configurations" but on the container it loads "0 configurations".
The file exists on linux, I use bash and check it myself. I use the head command and the file has lines.
Before this I tried with Files.lines like so:
var vehicleConfigurations = Files.lines(file)
.map(line -> build(line, codeType))
.collect(Collectors.toCollection(HashSet::new));
But this has a problem (on container only) regarding the contents. It reads the file but not the whole file, it reaches a given line (say line 8000) and does not read it completely (reads about half a line before the comma separator). Then I get a java.lang.ArrayIndexOutOfBoundsException because my build method tries to split then line and I access index 1 (which it doesn't have, only 0):
private VehicleConfiguration build(String line, CodeType codeType) {
String[] cells = line.split(lineSeparator);
var vc = new VehicleConfiguration();
vc.setVin(cells[0]);
vc.setCode(cells[1]);
vc.setType(codeType);
return vc;
}
What could be the issue? I don't understand how the same code (in Java) works on Windows but not on a Linux container. It makes no sense.
I'm using Java 11. The file is copied using volumes in a docker-compose file like this:
volumes:
- ./file-sources:/file-sources
I then copy the file (using cp command on the linux container) from file-sources to /root because that's where the app is listening for new files to arrive. File contents are then read with the methods I described. Example file data (does not have weird characters):
Thanks in advance.
UPDATE: Tried with newBufferedReader method, same result (works on windows, doesn't work on linux container):
private Set<VehicleConfiguration> loadConfigurations(Path file, CodeType codeType) throws IOException {
String line;
Set<VehicleConfiguration> configurations = new HashSet<>(); // this way we ignore duplicates in the same file
try(BufferedReader br = Files.newBufferedReader(file)){
while ((line = br.readLine()) != null) {
configurations.add(build(line, codeType));
}
}
log.debug("Loaded " + configurations.size() + " configurations");
return configurations;
}
wc -l in the linux container (in /root) returns: 15185 hard_001.csv
Update: This is no solution but I found out that by dropping the files directly on the file-sources folder and make that folder the folder that the code listens to, the files are read. So basically, it seems the problem is more apparent with using cp/mv inside the container to another folder. Maybe the file is read before it is fully copied/moved and that's why it reads 0 configurations?

There are a few methods in java you should never use. ever.
new FileReader(File) is one of them.
Any time that you have a thing that represents bytes and somehow chars or Strings fall out, or vice versa? Don't ever use those, unless the spec of said method explicitly points out that it always uses a pre-set charset. Almost all such methods use the 'system default charset' which means that the operation depends on the machine you run it on. That is shorthand for 'this will fail, and your tests won't catch it'. Which you don't want.
Which is why you should never use these things.
FileReader has been fixed (there is a second constructor that takes a charset), but that's only since JDK11. You already have the nice new API, why do you switch back to the dinky old File API? Don't do that.
All the various methods in Files, such as Files.newBufferedReader, are specced to do UTF-8 if you don't specify (in that way, Files is more useful, and unlike most other java core libraries). Thus:
try (BufferedReader br = Files.newBufferedReader(file)) {
which is just.. better.. than your line.
Now, it'll probably still fail on you. But that's good! It'll also fail on your dev machine. Most likely, the file you are reading is not, in fact, in UTF_8. This is the likely guess; most linuxen are deployed with a UTF_8 default charset, and most dev machines are not; if your dev machine is working and your deployment environment isn't, the obvious conclusion is that your input file is not UTF_8. It does not need to be what your dev machine has a default either; something like ISO_8859_1 will never throw exceptions, but it will read gobbledygook instead. Your code may seem to work (no crashes), but the text you read is still incorrect.
Figure out what text encoding you got, and then specify it. If it's ISO_8859_1, for example:
try (BufferedReader br = Files.newBufferedReader(file, StandardCharsets.ISO_8859_1)) {
and now your code no longer has the 'works on some machines but not on others' nature.
Inspect the line where it fails, in a hex editor if you have to. I bet you dollars to donuts there will be a byte there which is 0x80 or higher (in decimal, 128 or higher). Everything up to and including 127 tends to mean the exact same thing in a wide variety of text encodings, from ASCII to any ISO-8859 variant to UTF-8 Windows Cp1252 to macroman to so many other things, so as long as it's all just plain letters and digits, having the wrong encoding is not going to make any difference. But once you get to 0x80 or higher they're all different. Armed with that byte + some knowledge of what character it is supposed to be is usually a good start in figuring out what encoding that text file is in.
NB: If this isn't it, check how the text file is being copied from your dev machine to your deployment environment. Are you sure it is the same file? If it's being copied through a textual mechanism, charset encoding again can be to blame, but this time in how the file is written, instead of how your java app reads it.

Rename File using Java renameTo and characters like á - í - ñ

When I run this code
String path = "E:";
File F = new File(path,"TEST.txt");
File FF = new File(path, "áéíóúñ.txt" );
F.renameTo(FF);
I get a file with this name: Ã¡Ã©ÃÃ³ÃºÃ±.txt
Can I get the rigth name somehow?
I think my .txt file is saved in UTF-8 and It still does not work, i save it again anyway. i used the syntax \uXXXX, but it doesnt work, anyway i have to read the name since a txt file, when i read the name, i can see name with characters like "áéñ.. (using scanner class)" but when used renameTo, it doesnt work.

Java : create DOS formatted text file

I have a tomcat running on a Linux server.
My webapp is creating text files that must be imported by another external system that accepts DOS/Windows formatted files.
FileWriterWithEncoding writer;
writer = new FileWriterWithEncoding(file,"UTF-8", true);
PrintWriter printer = new PrintWriter(writer);
How can I create such DOS formatted files with Java on a Linux server?
Thank you.

Make sure that the line endings you write are "\r\n", this is the Windows way of writing them (carriage return character + line feed character) .

Different Result in Java Netbeans Program

I am working on a small program to find text in a text file but I am getting a different result depending how I run my program.
When running my program from Netbeans I get 866 matches.
When running my program by double clicking on the .jar file in the DIST folder, I get 1209 matches (The correct number)
It seems that when I'm running the program from Netbeans, it doesn't get to the end of the text file. Is that to be expected ?
Text File in question
Here is my code for reading the file:
#FXML
public void loadFile(){
//Loading file
try{
linelist.clear();
aclist.clear();
reader = new Scanner(new File(filepathinput));
while(reader.hasNext()){
linelist.add(reader.nextLine());
}
for(int i = 0; i < linelist.size()-1; i++){
if(linelist.get(i).startsWith("AC#")){
aclist.add(linelist.get(i));
}
}
}
catch(java.io.FileNotFoundException e){
System.out.println(e);
}
finally{
String accountString = String.valueOf(aclist.size());
account.setText(accountString);
reader.close();
}
}

The problem is an incompatibility between the java app's (i.e. JVM) default file encoding and the input file's encoding.
The file's encoding is "ANSI" which commonly maps to Windows-1252 encoding (or its variants) on Windows machines.
When running the app from the command prompt, the JVM (so the Scanner implicitly) will take the system default file encoding which is Windows-1252. Reading the same encoded file with this setup will not cause the problem.
However, Netbeans by default sets the project encoding to utf-8, therefore when running the app from Netbeans its file encoding is utf-8. Reading the file with this encoding resulting to confusion of the scanner. The character "ï" (0xEF) of the text "Caraïbes" is the cause of the problem. Since it is one of characters of BOM (ï»¿ = 0xEF 0xBB 0xBF) sequence, it is somehow messing up the scanner.
As a solution,
either specify the encoding type of the scanner explicitly
reader = new Scanner(file, "windows-1252");
or convert the input file encoding to utf-8 using notepad or better notepad++, and set encoding type to utf-8 without using system default.
reader = new Scanner(file, "utf-8");
However, when the different OSes are considered, working with utf-8 at all places will the preferred way dealing with multi-platform environments. Hence the 2nd way is to go.

It can also depend on the filepathinput input. When jar and netbeans both might be referring to two different files. Possibly with same name in different location. Can you give more information on the filepathinput variable value?

Java text encoding

I read lines from a .txt file into a String list. I show the text in a JTextPane. The encoding is fine when running from Eclipse or NetBeans, however if I create a jar, the encoding is not correct. The encoding of the file is UTF-8. Is there a way to solve this problem?

Your problem is probably that you're opening a reader using the platform encoding.
You should manually specify the encoding whenever you convert between bytes and characters. If you know that the appropriate encoding is UTF-8 you can open a file thus:
FileInputStream inputFile = new FileInputStream(myFile);
try {
FileReader reader = new FileReader(inputFile, "UTF-8");
// Maybe buffer reader and do something with it.
} finally {
inputFile.close();
}
Libraries like Guava can make this whole process easier..

Have you tried to run your jar as
java -Dfile.encoding=utf-8 -jar xxx.jar

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.