Java text encoding

Java text encoding - java

I read lines from a .txt file into a String list. I show the text in a JTextPane. The encoding is fine when running from Eclipse or NetBeans, however if I create a jar, the encoding is not correct. The encoding of the file is UTF-8. Is there a way to solve this problem?

Your problem is probably that you're opening a reader using the platform encoding.
You should manually specify the encoding whenever you convert between bytes and characters. If you know that the appropriate encoding is UTF-8 you can open a file thus:
FileInputStream inputFile = new FileInputStream(myFile);
try {
FileReader reader = new FileReader(inputFile, "UTF-8");
// Maybe buffer reader and do something with it.
} finally {
inputFile.close();
}
Libraries like Guava can make this whole process easier..

Have you tried to run your jar as
java -Dfile.encoding=utf-8 -jar xxx.jar

Related

Character encoding while parsing a file with Java on Linux

With a java script, I am trying to read a file that contains the name of a file on my Linux filesystem. It points to a file that was generated on a Windows OS and has accents in it's name.
Example of this kind of file "input.csv" :
MYFILE_tést.doc;1
The java program parses the file in question and verifies if that file exists. But for those lines containing accents, file.exists() in Java always returns false.
The file "input.csv" is generated in Windows and encoded in ISO-8859-1
Linux locales are configured like this :
LANG=en_US.ISO-8859-1
LC_CTYPE="en_US.ISO-8859-1"
LC_NUMERIC="en_US.ISO-8859-1"
LC_TIME="en_US.ISO-8859-1"
LC_COLLATE="en_US.ISO-8859-1"
LC_MONETARY="en_US.ISO-8859-1"
LC_MESSAGES="en_US.ISO-8859-1"
LC_PAPER="en_US.ISO-8859-1"
LC_NAME="en_US.ISO-8859-1"
LC_ADDRESS="en_US.ISO-8859-1"
LC_TELEPHONE="en_US.ISO-8859-1"
LC_MEASUREMENT="en_US.ISO-8859-1"
LC_IDENTIFICATION="en_US.ISO-8859-1"
LC_ALL=en_US.ISO-8859-1
When reading the CSV file in java, i'm forcing the encoding :
csvFile = new BufferedReader(new InputStreamReader(new
FileInputStream(FILE_CSV), "ISO-8859-1"));
I tried switching to UTF-8 (OS locales + file encoding) or playing with the -Dfile.encoding=ISO-8859-1 JVM parameter but still the same problem.
The problem doesn't ocur if i hardcode the filename with the accents in the source code instead of reading it in the csv file.
Any ideas of how to fix this ?
Thank you for your help

Create bat from java code keeping apostrophes and accented characters

I have created a java gui to generate bat files.
When I write the bat containing a string like this: "L’uomo più forte" notepad++ shows this: "L?uomo pi— forte"
Here is the code:
FileOutputStream fos = new FileOutputStream(bat);
Writer w = new BufferedWriter(new OutputStreamWriter(fos, "Cp850"));
String stringa = "L’uomo più forte"
w.write(stringa);
w.write("\n");
w.write("pause");
w.write("\n");
w.flush();
w.close();
I had to use cp850 for dos use. Using base charset the bat give error.
Solutions?

Instead of using "Cp850":
Writer w = new BufferedWriter(new OutputStreamWriter(fos, "Cp850"));
Try using "UTF-8":
Writer w = new BufferedWriter(new OutputStreamWriter(fos, "UTF-8"));
For more information on UTF-8 Encoding Go Here.
Don't forget to place a semicolon (;) at the end of your stringa string variable declaration/initialization and surround your code in try/catch block to handle possible FileNotFoundException, UnsupportedEncodingException, and IOException Exceptions.
Also...for NotePad, to provide a new line you need to also supply the \r tag:
w.write("\r\n");

You may do the following:
Open a cmd.exe command prompt session.
Execute echo L’uomo più forte>text.txt command.
Open text.txt file and copy the string there.
Paste it in your code.
For example, I created the text.txt file, renamed it to test.bat and appended an echo command. This is my test.bat file:
#echo off
echo L'uomo pi— forte
... and this is the output:
L'uomo più forte
If some character is not displayed correctly, then the code page used does not contain such a character.
Note: I suggest you to use the standard Windows Notepad. The notepad++ program may cause strange output in these cases.

Different Result in Java Netbeans Program

I am working on a small program to find text in a text file but I am getting a different result depending how I run my program.
When running my program from Netbeans I get 866 matches.
When running my program by double clicking on the .jar file in the DIST folder, I get 1209 matches (The correct number)
It seems that when I'm running the program from Netbeans, it doesn't get to the end of the text file. Is that to be expected ?
Text File in question
Here is my code for reading the file:
#FXML
public void loadFile(){
//Loading file
try{
linelist.clear();
aclist.clear();
reader = new Scanner(new File(filepathinput));
while(reader.hasNext()){
linelist.add(reader.nextLine());
}
for(int i = 0; i < linelist.size()-1; i++){
if(linelist.get(i).startsWith("AC#")){
aclist.add(linelist.get(i));
}
}
}
catch(java.io.FileNotFoundException e){
System.out.println(e);
}
finally{
String accountString = String.valueOf(aclist.size());
account.setText(accountString);
reader.close();
}
}

The problem is an incompatibility between the java app's (i.e. JVM) default file encoding and the input file's encoding.
The file's encoding is "ANSI" which commonly maps to Windows-1252 encoding (or its variants) on Windows machines.
When running the app from the command prompt, the JVM (so the Scanner implicitly) will take the system default file encoding which is Windows-1252. Reading the same encoded file with this setup will not cause the problem.
However, Netbeans by default sets the project encoding to utf-8, therefore when running the app from Netbeans its file encoding is utf-8. Reading the file with this encoding resulting to confusion of the scanner. The character "ï" (0xEF) of the text "Caraïbes" is the cause of the problem. Since it is one of characters of BOM (ï»¿ = 0xEF 0xBB 0xBF) sequence, it is somehow messing up the scanner.
As a solution,
either specify the encoding type of the scanner explicitly
reader = new Scanner(file, "windows-1252");
or convert the input file encoding to utf-8 using notepad or better notepad++, and set encoding type to utf-8 without using system default.
reader = new Scanner(file, "utf-8");
However, when the different OSes are considered, working with utf-8 at all places will the preferred way dealing with multi-platform environments. Hence the 2nd way is to go.

It can also depend on the filepathinput input. When jar and netbeans both might be referring to two different files. Possibly with same name in different location. Can you give more information on the filepathinput variable value?

java - File charset

I have an application, which proccesses some text and then saves it to file.
When I run it from NetBeans IDE, both System.out and PrintWriter work correct and non-ACSII characters are displayed/saved correctly. But, if I run the JAR from Windows 7 command line (which uses the cp1250 (central european) encoding in this case) screen output and saved file are broken.
I tried to put UTF-8 to PrintWriter's constructor, but it didn't help… And it can't affect System.out, which will be corrupted even after this.
Why is it working in the IDE and not in cmd.exe?
I would understand that System.out has some problems, but why is also output file affected?
How can I fix this issue?

I just had the same problem.
Actual reason of that is because when your code is ran in NetBeans environment, NetBeans automatically sets properties of the system.
You can see that when you run your code with NetBeans, the code below probably prints "UTF-8". But when you run it with cmd, you sure will see "cp1256".
System.getProperty("file.encoding");
You should notice that while using 'setProperty' will change the output of 'getProperty' function, it will not have any effect on Input/Outputs. (because they are all set before the main function is called.)
Having this background in mind, when you want to read from files and write to them, It's better to use codes below:
File f = new File(sourcePath);
For reading:
InputStreamReader isr = new InputStreamReader(
new FileInputStream(f), Charset.forName("UTF-8"));
and for writing (I have not tested this):
OutputStreamWriter osw = new OutputStreamWriter(
new FileOutputStream(f), Charset.forName("UTF-8"));
the main difference is that these Classes get required Charset in their constructors, but classes like FileWrite and PrintWrite don't.
I hope that works for you.

How to print the content of a tar.gz file with Java?

I have to implement an application that permits printing the content of all files within a tar.gz file.
For Example:
if I have three files like this in a folder called testx:
A.txt contains the words "God Save The queen"
B.txt contains the words "Ubi maior, minor cessat"
C.txt.gz is a file compressed with gzip that contain the file c.txt with the words "Hello America!!"
So I compress testx, obtain the compressed tar file: testx.tar.gz.
So with my Java application I would like to print in the console:
"God Save The queen"
"Ubi maior, minor cessat"
"Hello America!!"
I have implemented the ZIP version and it works well, but keeping tar library from apache ant http://commons.apache.org/compress/, I noticed that it is not easy like ZIP java utils.
Could someone help me?
I have started looking on the net to understand how to accomplish my aim, so I have the following code:
GZIPInputStream gzipInputStream=null;
gzipInputStream = new GZIPInputStream( new FileInputStream(fileName));
TarInputStream is = new TarInputStream(gzipInputStream);
TarEntry entryx = null;
while((entryx = is.getNextEntry()) != null) {
if (entryx.isDirectory()) continue;
else {
System.out.println(entryx.getName());
if ( entryx.getName().endsWith("txt.gz")){
is.copyEntryContents(out);
// out is a OutputStream!!
}
}
}
So in the line is.copyEntryContents(out), it is possible to save on a file the stream passing an OutputStream, but I don't want it! In the zip version after keeping the first entry, ZipEntry, we can extract the stream from the compressed root folder, testx.tar.gz, and then create a new ZipInputStream and play with it to obtain the content.
Is it possible to do this with the tar.gz file?
Thanks.

surfing the net, i have encountered an interesting idea at : http://hype-free.blogspot.com/2009/10/using-tarinputstream-from-java.html.
After converting ours TarEntry to Stream, we can adopt the same idea used with Zip Files like:
InputStream tmpIn = new StreamingTarEntry(is, entryx.getSize());
// use BufferedReader to get one line at a time
BufferedReader gzipReader = new BufferedReader(
new InputStreamReader(
new GZIPInputStream(
inputZip )));
while (gzipReader.ready()) { System.out.println(gzipReader.readLine()); }
gzipReader.close();
SO with this code you could print the content of the file testx.tar.gz ^_^

To not have to write to a File you should use a ByteArrayOutputStream and use the public String toString(String charsetName)
with the correct encoding.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java text encoding - java

I read lines from a .txt file into a String list. I show the text in a JTextPane. The encoding is fine when running from Eclipse or NetBeans, however if I create a jar, the encoding is not correct. The encoding of the file is UTF-8. Is there a way to solve this problem?

Have you tried to run your jar as java -Dfile.encoding=utf-8 -jar xxx.jar

Related

Character encoding while parsing a file with Java on Linux

Create bat from java code keeping apostrophes and accented characters

Different Result in Java Netbeans Program

java - File charset

How to print the content of a tar.gz file with Java?

Categories

Resources