I am working on a small program to find text in a text file but I am getting a different result depending how I run my program.
When running my program from Netbeans I get 866 matches.
When running my program by double clicking on the .jar file in the DIST folder, I get 1209 matches (The correct number)
It seems that when I'm running the program from Netbeans, it doesn't get to the end of the text file. Is that to be expected ?
Text File in question
Here is my code for reading the file:
#FXML
public void loadFile(){
//Loading file
try{
linelist.clear();
aclist.clear();
reader = new Scanner(new File(filepathinput));
while(reader.hasNext()){
linelist.add(reader.nextLine());
}
for(int i = 0; i < linelist.size()-1; i++){
if(linelist.get(i).startsWith("AC#")){
aclist.add(linelist.get(i));
}
}
}
catch(java.io.FileNotFoundException e){
System.out.println(e);
}
finally{
String accountString = String.valueOf(aclist.size());
account.setText(accountString);
reader.close();
}
}
The problem is an incompatibility between the java app's (i.e. JVM) default file encoding and the input file's encoding.
The file's encoding is "ANSI" which commonly maps to Windows-1252 encoding (or its variants) on Windows machines.
When running the app from the command prompt, the JVM (so the Scanner implicitly) will take the system default file encoding which is Windows-1252. Reading the same encoded file with this setup will not cause the problem.
However, Netbeans by default sets the project encoding to utf-8, therefore when running the app from Netbeans its file encoding is utf-8. Reading the file with this encoding resulting to confusion of the scanner. The character "ï" (0xEF) of the text "Caraïbes" is the cause of the problem. Since it is one of characters of BOM ( = 0xEF 0xBB 0xBF) sequence, it is somehow messing up the scanner.
As a solution,
either specify the encoding type of the scanner explicitly
reader = new Scanner(file, "windows-1252");
or convert the input file encoding to utf-8 using notepad or better notepad++, and set encoding type to utf-8 without using system default.
reader = new Scanner(file, "utf-8");
However, when the different OSes are considered, working with utf-8 at all places will the preferred way dealing with multi-platform environments. Hence the 2nd way is to go.
It can also depend on the filepathinput input. When jar and netbeans both might be referring to two different files. Possibly with same name in different location. Can you give more information on the filepathinput variable value?
Related
I have a java project in which i want to take input from the user.
I wrote the code in eclipse and it was running without any problems at all.
However, when I export my classes into an executable-jar file using eclipse and try to run it in the windows cmd, the Scanner(System.in) can't read charachters in UTF-8 (greek characters) or something else that i haven't thought about.
This is the part of the code where i run into the problem :
String yesORno = inp.stringScanner(); // basically a nextLine()
while (!(yesORno.equals("ΝΑΙ") || yesORno.equals("ΟΧΙ"))) { // ΝΑΙ and OXI are greek characters not latin
System.out.println("Παρακαλώ πληκτρολογίστε 'ΝΑΙ' ή 'ΟΧΙ'"); // please type ΝΑΙ or ΟΧΙ in greek
yesORno = inp.stringScanner(); // take input again
}
inp is an object of an other class which i use to take inputs, in this case with the method stringScanner()
public String stringScanner() {
Scanner in = new Scanner(System.in);
return in.nextLine();
}
So when i run the code in eclipse and enter some sample characters for testing i get :
And that's what i want to happen every time.
But when i run the jar file i get :
As you can see the jar file for some reason doesn't recognise greek NAI and yesORno.equals("ΝΑΙ") doesn't return true to stop the while loop.
The same happens with OXI
I have tried running the jar file by using a .bat file like :
start java -Dfile.encoding=UTF-8 -jar Myfile.jar
but no solution.
I've done a lot of reserch to resolve this problem but I have found nothing.
I would appreciate your help
The JVM argument -Dfile.encoding tells the JVM what is the default encoding for (text) files it may encounter. This includes stdin, stdout and stderr – mapped to System.in, System.out and System.err. But the argument will not change anything in the operating system.
Most probably, your Windows CMD is using the Windows-1253 encoding, not UTF-8. When you tell the JVM with the -Dfile.encoding argument that it would be UTF-8, that would be an outright lie …
Try start java -Dfile.encoding=Windows-1253 -jar Myfile.jar or start java -Dfile.encoding=ISO-8859-7 -jar Myfile.jar.
If you setup your system with Windows-1253, the second option may cause other problems, as ISO-8859-7 and Windows-1253 are not fully compatible. But for your test it should do the job.
According to the documentation, the way you use the scanner will always depend on the operating system's encoding settings.
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Scanner.html#%3Cinit%3E(java.io.InputStream)
Look at the alternative constructors - you can define the encoding there directly. Your code could look like
Scanner in = new Scanner(System.in, "UTF-8");
Like the title says I'm not able to read the contents of a file (csv file) while running the same code on a linux container
private Set<VehicleConfiguration> loadConfigurations(Path file, CodeType codeType) throws IOException {
log.debug("File exists? " + Files.exists(file));
log.debug("Path " + file.toString());
log.debug("File " + file.toFile().toString());
log.debug("File absolute path " + file.toAbsolutePath().toString());
String line;
Set<VehicleConfiguration> configurations = new HashSet<>(); // this way we ignore duplicates in the same file
try(BufferedReader br = new BufferedReader(new FileReader(file.toFile()))){
while ((line = br.readLine()) != null) {
configurations.add(build(line, codeType));
}
}
log.debug("Loaded " + configurations.size() + " configurations");
return configurations;
}
The logs return "true" and the path for the file in both systems (locally on windows and on a linux docker container). On windows it loads "15185 configurations" but on the container it loads "0 configurations".
The file exists on linux, I use bash and check it myself. I use the head command and the file has lines.
Before this I tried with Files.lines like so:
var vehicleConfigurations = Files.lines(file)
.map(line -> build(line, codeType))
.collect(Collectors.toCollection(HashSet::new));
But this has a problem (on container only) regarding the contents. It reads the file but not the whole file, it reaches a given line (say line 8000) and does not read it completely (reads about half a line before the comma separator). Then I get a java.lang.ArrayIndexOutOfBoundsException because my build method tries to split then line and I access index 1 (which it doesn't have, only 0):
private VehicleConfiguration build(String line, CodeType codeType) {
String[] cells = line.split(lineSeparator);
var vc = new VehicleConfiguration();
vc.setVin(cells[0]);
vc.setCode(cells[1]);
vc.setType(codeType);
return vc;
}
What could be the issue? I don't understand how the same code (in Java) works on Windows but not on a Linux container. It makes no sense.
I'm using Java 11. The file is copied using volumes in a docker-compose file like this:
volumes:
- ./file-sources:/file-sources
I then copy the file (using cp command on the linux container) from file-sources to /root because that's where the app is listening for new files to arrive. File contents are then read with the methods I described. Example file data (does not have weird characters):
Thanks in advance.
UPDATE: Tried with newBufferedReader method, same result (works on windows, doesn't work on linux container):
private Set<VehicleConfiguration> loadConfigurations(Path file, CodeType codeType) throws IOException {
String line;
Set<VehicleConfiguration> configurations = new HashSet<>(); // this way we ignore duplicates in the same file
try(BufferedReader br = Files.newBufferedReader(file)){
while ((line = br.readLine()) != null) {
configurations.add(build(line, codeType));
}
}
log.debug("Loaded " + configurations.size() + " configurations");
return configurations;
}
wc -l in the linux container (in /root) returns: 15185 hard_001.csv
Update: This is no solution but I found out that by dropping the files directly on the file-sources folder and make that folder the folder that the code listens to, the files are read. So basically, it seems the problem is more apparent with using cp/mv inside the container to another folder. Maybe the file is read before it is fully copied/moved and that's why it reads 0 configurations?
There are a few methods in java you should never use. ever.
new FileReader(File) is one of them.
Any time that you have a thing that represents bytes and somehow chars or Strings fall out, or vice versa? Don't ever use those, unless the spec of said method explicitly points out that it always uses a pre-set charset. Almost all such methods use the 'system default charset' which means that the operation depends on the machine you run it on. That is shorthand for 'this will fail, and your tests won't catch it'. Which you don't want.
Which is why you should never use these things.
FileReader has been fixed (there is a second constructor that takes a charset), but that's only since JDK11. You already have the nice new API, why do you switch back to the dinky old File API? Don't do that.
All the various methods in Files, such as Files.newBufferedReader, are specced to do UTF-8 if you don't specify (in that way, Files is more useful, and unlike most other java core libraries). Thus:
try (BufferedReader br = Files.newBufferedReader(file)) {
which is just.. better.. than your line.
Now, it'll probably still fail on you. But that's good! It'll also fail on your dev machine. Most likely, the file you are reading is not, in fact, in UTF_8. This is the likely guess; most linuxen are deployed with a UTF_8 default charset, and most dev machines are not; if your dev machine is working and your deployment environment isn't, the obvious conclusion is that your input file is not UTF_8. It does not need to be what your dev machine has a default either; something like ISO_8859_1 will never throw exceptions, but it will read gobbledygook instead. Your code may seem to work (no crashes), but the text you read is still incorrect.
Figure out what text encoding you got, and then specify it. If it's ISO_8859_1, for example:
try (BufferedReader br = Files.newBufferedReader(file, StandardCharsets.ISO_8859_1)) {
and now your code no longer has the 'works on some machines but not on others' nature.
Inspect the line where it fails, in a hex editor if you have to. I bet you dollars to donuts there will be a byte there which is 0x80 or higher (in decimal, 128 or higher). Everything up to and including 127 tends to mean the exact same thing in a wide variety of text encodings, from ASCII to any ISO-8859 variant to UTF-8 Windows Cp1252 to macroman to so many other things, so as long as it's all just plain letters and digits, having the wrong encoding is not going to make any difference. But once you get to 0x80 or higher they're all different. Armed with that byte + some knowledge of what character it is supposed to be is usually a good start in figuring out what encoding that text file is in.
NB: If this isn't it, check how the text file is being copied from your dev machine to your deployment environment. Are you sure it is the same file? If it's being copied through a textual mechanism, charset encoding again can be to blame, but this time in how the file is written, instead of how your java app reads it.
I have been using Geany to create Java programs, where until now I was able to compile them successfully. The simple program created below in Java was made using Geany, however the illegal character error (\u0000) occurred.
public class SumOfCubedDigits
{
public static void main(String[] args)
{
for (int i=1; i<=9; i++)
{
for (int j=0; j<=9; j++)
{
for (int k=0; k<=9; k++)
{
double iCubed=Math.pow(i,3);
double jCubed=Math.pow(j,3);
double kCubed=Math.pow(k,3);
double cubedDigits = iCubed + jCubed + kCubed;
int concatenatedDigits = (i*100 + j*10 + k);
if (cubedDigits==concatenatedDigits)
{
System.out.println(concatenatedDigits);
}
}
}
}
}
}
I recreated the program in nano and it was able to compile successfully. I then copied it across to Geany under a different name of SumTest.java, compiled it and got the same illegal character error. Clearly the error is with the Geany IDE for Raspberry Pi. I'd like to know how I could fix the editor to create and compile programs successfully as it not just this program, it is any program created in Java using Geany.
This might be a problem with encoding that Geany uses when saving the source file.
If you compile the file with javac without specifying the -encoding parameter the platform's default encoding is used. On a modern Linux this is likely to be UTF-8; on Windows it is one of the ANSI character sets or UTF-16, I think.
To find out what the default encoding is, you can compile and run a small java program:
public class DefaultCharsetPrinter {
public static void main(String[] argv) {
System.out.println(Charset.defaultCharset());
}
}
This should print the name of the default encoding used by java programs.
In Geany you can set the file encoding in menu Document > Set Encoding. You need to set this to the same value used by javac. The Geany manual describes additional options for setting the encoding.
As you are seeing a lot errors complaining about the null character it is most likely that Geany stores the file in an encoding with multiple bytes per character (for instance UTF-16) while javac uses an encoding with a single byte per character. If I save your source file as UTF-16 and then try to compile it with javac using UTF-8 encoding, I get the same error messages that you see. After saving the file as UTF-8 in Geany, the file compiles without problems.
I had the same problem with a file i generated using the command echo echo "" > Main.java in Windows Powershell.
I searched the problem and it seemed to have something to do with encoding. I checked the encoding of the file using file -i Main.java and the result was text/plain; charset=utf-16le.
Later i deleted the file and recreated it using git bash using touch Main.java and with this the file compiled successfully. I checked the file encoding using file -i command and this time the result was Main.java: text/x-c; charset=us-ascii.
Next i searched the internet and found that to create an empty file using Powershell we can use the Cmdlet New-Item. I create the file using New-Item Main.java and checked it's encoding and this time the result was Main.java: text/x-c; charset=us-ascii and this time it compiled successully.
With a java script, I am trying to read a file that contains the name of a file on my Linux filesystem. It points to a file that was generated on a Windows OS and has accents in it's name.
Example of this kind of file "input.csv" :
MYFILE_tést.doc;1
The java program parses the file in question and verifies if that file exists. But for those lines containing accents, file.exists() in Java always returns false.
The file "input.csv" is generated in Windows and encoded in ISO-8859-1
Linux locales are configured like this :
LANG=en_US.ISO-8859-1
LC_CTYPE="en_US.ISO-8859-1"
LC_NUMERIC="en_US.ISO-8859-1"
LC_TIME="en_US.ISO-8859-1"
LC_COLLATE="en_US.ISO-8859-1"
LC_MONETARY="en_US.ISO-8859-1"
LC_MESSAGES="en_US.ISO-8859-1"
LC_PAPER="en_US.ISO-8859-1"
LC_NAME="en_US.ISO-8859-1"
LC_ADDRESS="en_US.ISO-8859-1"
LC_TELEPHONE="en_US.ISO-8859-1"
LC_MEASUREMENT="en_US.ISO-8859-1"
LC_IDENTIFICATION="en_US.ISO-8859-1"
LC_ALL=en_US.ISO-8859-1
When reading the CSV file in java, i'm forcing the encoding :
csvFile = new BufferedReader(new InputStreamReader(new
FileInputStream(FILE_CSV), "ISO-8859-1"));
I tried switching to UTF-8 (OS locales + file encoding) or playing with the -Dfile.encoding=ISO-8859-1 JVM parameter but still the same problem.
The problem doesn't ocur if i hardcode the filename with the accents in the source code instead of reading it in the csv file.
Any ideas of how to fix this ?
Thank you for your help
I have a program that grabs some strings from a location and puts them in a list. I also have an "exclusions list" that loads from a file. If the current string is in the exclusions list, it gets ignored.
In my exclusions list file, I have this string:
Something ›
Note, that is not a typical angle bracket. It's a special character (dec value 8250)
When I run this in Eclipse, everything works perfectly. My program sees that the Something › is in the exclusions list and ignores it. However, when I build and run my program as a jar, the Something › does not get ignored. Everything else works fine, it's just that one string.
I'm assuming it's because of the ›, which means it must be encoding related. However, I have the text file saved as UTF-8 (without BOM), and my eclipse is configured as UTF-8, too. Any ideas?
This seems to have fixed it. I changed the way it loaded the text file from:
Scanner fileIn = new Scanner(new File(filePath));
to
Scanner fileIn = new Scanner(new FileInputStream(filePath), "UTF-8");