Java Character Encoding Writing to Text File

Java Character Encoding Writing to Text File - java

My Issue is as follows:
Having issue with character encoding when writing to text file. The issue is characters are not showing the intended value. for example I am writing ' '(which is probably a Tab character) and 'Â' is what is displayed in the text file.
Background information
This data is being stored on a MSQL Database. The Database Collation is SQL_Latin1_General_CP1_CI_AS and the fields are varchar. I've come to learn the collation and type determine what character encoding is used on the database side. Values are stored correctly so no issues here.
My Java application runs queries to pull the data from the DB and this too also looks OK. I have debugged the code and seen all the Strings have the correct representation before writing to the file.
Next I write the text to the .TXT file using a OutputStreamWriter as follows:
public OfferFileBuilder(String clientAppName, boolean isAppend) throws IOException, URISyntaxException {
String exportFileLocation = getExportedFileLocation();
File offerFile = new File(getDatedFileName(exportFileLocation+"/"+clientAppName+"_OFFERRECORDS"));
bufferedWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(offerFile, isAppend), "UTF-8"));
}
Now once I open up the file on the Linux server by running cat command on file or open up the file using notepad++ some of the characters are incorrectly displaying.
I've ran the following commands on the server to see its encoding locale charmap which prints UTF-8, echo $LANG which prints en_US.UTF-8, and echo $LC_CTYPE` prints nothing.
Here is what I've attempted so far.
I've attempted to change the Character encoding used by the OutputStreamWriter I've tried UTF-8, and CP1252. When switching encoding some characters are fixed when others are then improperly displayed.
My Question is this:
Which encoding should my OutputStreamWriter be using?
(Bonus Questions) how are we supposed to avoid issues like this from happening. The rule of thumb i was provided was use UTF-8 and you will never run into problems, but this isn't the case for me right now.

running file -bi command on the server revealed that the file was encoded with ascii instead of utf8. Removing the file completely and rerunning the process fixed this for me.

Related

Java saving a file with special characters in file name

I'm having a problem on Java file encoding.
I have a Java program will save a input stream as a file with a given file name, the code snippet is like:
File out = new File(strFileName);
Files.copy(inStream, out.toPath());
It works fine on Windows unless the file name contains some special characters like Ö, with these characters in the file name, the saved file will display a garbled file name on Windows.
I understand that by applying JVM option -Dfile.encoding=UTF-8 this issue can be fixed, but I would have a solution in my code rather than ask all my users to change their JVM options.
While debugging the program I can see the file name string always shows the correct character, so I guess the problem is not about internal encoding.
Could someone please explain what went wrong behind the scene? and is there a way to avoid this problem programmatically? I tried get the bytes from the string and change the encoding but it doesn't work.
Thanks.

Using the URLEncoder class would work:
String name = URLEncoder.encode("fileName#", "UTF-8");
File output = new File(name);

Java reading from binary file

I have some calculated data (floats and integers), which is written to a 12mb file like this
DataOutputStream os3;
os3 = new DataOutputStream(new FileOutputStream(Cache.class.getResource("/3w.dat").getPath()));
...... (some loops)
os3.writeFloat(f);
os3.writeInt(r);
os3.close();
And after that I read it this way
DataInputStream is3;
is3 = new DataInputStream(new FileInputStream(Cache.class.getResource("/3w.dat").getPath()));
...... (same loops)
is3.readFloat();
is3.readInt();
is3.close();
So, I wrote file only once on Windows 7. After that I only read it while app starts. File reading works fine on Windows 7, but when I try do to it on Ubuntu, I get EOF exception (code and file are the same).
I suspect problem may be caused by some NaN values written to the file.
BTW. While debugging I figured out that on ubuntu reading process runs about 15% of loop and throws an exception. All values it reads are "0.0", but file doesnt contain zeros.

The problem should be the different of line break between Linux and Windows. As mentioned by Clark, just replace the \n with \r\n and the end of line character could resolve this. You can use Notpad++ to simplify this work. Just click on Menu :
Edit -> EOL conversion -> Convert to UNIX format
It might be also a problem of encoding. If you'are using Windows, the default encoding is Windows-1252 or so called CP-1252, which is specially microsoft thing and cannot be recognized on Linux. Just change it into an encoding like UTF-8 which can be recognized by all OS. Using Notpad++:
Encoding -> Convert to UTF-8

Unicode working on Windows but not Red Hat Linux : Java

In Java, I am generating file having Unicode characters.
When I run my program in Windows (Jboss) and open the file (CSV). It finely displays Unicode characters (Norwegin and Icelandic) in excel.
But when I deploy the same in server inside Red Hat Linux (in Jboss same version), run the program, generate file and download and when I see that in excel then it is distorting all Unicode characters.
Could you please suggest any local Linux setting due to which Unicode is distorting? or where change is required?
FileWriter writer = new FileWriter(fileName);
writer.append(new String(data.toString().getBytes("UTF-8"),"UTF-8"));
writer.flush();
writer.close();
//data is StringBuilder type
I have also tried ISO8859_1
Update 1
I have checked System Encoding: using System.getProperty("file.encoding") and found that
Windows is Cp1252 and Linux is UTF-8
Update 2
When I print in Linux using :
log.info(new String(data.toString().getBytes("UTF-8"), "UTF-8"));
it is showing all output perfectly fine but when I put it in FileWriter with extension filename.csv, it is not displaying correctly.

It looks like you are translating from bytes
data
to String
data.toString()
to Bytes
data.toString().getBytes("UTF-8")
to String
new String(data.toString().getBytes("UTF-8"),"UTF-8"))
to bytes
writer.append(new String(data.toString().getBytes("UTF-8"),"UTF-8"));
Try just a single conversion from the input encoding to a String and then write out the String. So the data.toString() needs to know what encoding it is reading. Does data support conversion from different code pages?
writer.append(data.toString(codepage));

How can I change the Standard Out to "UTF-8" in Java

I download a file from a website using a Java program and the header looks like below
Content-Disposition attachment;filename="Textkürzung.asc";
There is no encoding specified
What I do is after downloading I pass the name of the file to another application for further processing. I use
System.out.println(filename);
In the standard out the string is printed as Textk³rzung.asc
How can I change the Standard Out to "UTF-8" in Java?
I tried to encode to "UTF-8" and the content is still the same
Update:
I was able to fix this without any code change. In the place where I call this my jar file from the other application, i did the following
java -DFile.Encoding=UTF-8 -jar ....
This seem to have fixed the issue
thank you all for your support

The default encoding of System.out is the operating system default. On international versions of Windows this is usually the windows-1252 codepage. If you're running your code on the command line, that is also the encoding the terminal expects, so special characters are displayed correctly. But if you are running the code some other way, or sending the output to a file or another program, it might be expecting a different encoding. In your case, apparently, UTF-8.
You can actually change the encoding of System.out by replacing it:
try {
System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, "UTF-8"));
} catch (UnsupportedEncodingException e) {
throw new InternalError("VM does not support mandatory encoding UTF-8");
}
This works for cases where using a new PrintStream is not an option, for instance because the output is coming from library code which you cannot change, and where you have no control over system properties, or where changing the default encoding of all files is not appropriate.

The result you're seeing suggests your console expects text to be in Windows "code page 850" encoding - the character ü has Unicode code point U+00FC. The byte value 0xFC renders in Windows code page 850 as ³. So if you want the name to appear correctly on the console then you need to print it using the encoding "Cp850":
PrintWriter consoleOut = new PrintWriter(new OutputStreamWriter(System.out, "Cp850"));
consoleOut.println(filename);
Whether this is what your "other application" expects is a different question - the other app will only see the correct name if it is reading its standard input as Cp850 too.

Try to use:
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(test);

Accents in file name using Java on Solaris

I have a problem where I can't write files with accents in the file name on Solaris.
Given following code
public static void main(String[] args) {
System.out.println("Charset = "+ Charset.defaultCharset().toString());
System.out.println("testéörtkuoë");
FileWriter fw = null;
try {
fw = new FileWriter("testéörtkuoë");
fw.write("testéörtkuoëéörtkuoë");
fw.close();
I get following output
Charset = ISO-8859-1
test??rtkuo?
and I get a file called "test??rtkuo?"
Based on info I found on StackOverflow, I tried to call the Java app by adding "-Dfile.encoding=UTF-8" at startup.
This returns following output
Charset = UTF-8
testéörtkuoë
But the filename is still "test??rtkuo?"
Any help is much appreciated.
Stef

All these characters are present in ISO-8859-1. I suspect part of the problem is that the code editor is saving files in a different encoding to the one your operating system is using.
If the editor is using ISO-8859-1, I would expect it to encode ëéö as:
eb e9 f6
If the editor is using UTF-8, I would expect it to encode ëéö as:
c3ab c3a9 c3b6
Other encodings will produce different values.
The source file would be more portable if you used Unicode escape sequences. At least be certain your compiler is using the same encoding as the editor.
Examples:
ë \u00EB
é \u00E9
ö \u00F6
You can look up these values using the Unicode charts.
Changing the default file encoding using -Dfile.encoding=UTF-8 might have unintended consequences for how the JVM interacts with the system.
There are parallels here with problems you might see on Windows.
I'm unable to reproduce the problem directly - my version of OpenSolaris uses UTF-8 as the default encoding.

If you attempt to list the filenames with the java io apis, what do you see? Are they encoded correctly? I'm curious as to whether the real problem is with encoding the filenames or with the tools that you are using to check them.

What happens when you do:
ls > testéörtkuoë
If that works (writes to the file correctly), then you know you can write to files with accents.

I got a similar problem. Contrary to that example, the program was unable to list the files correct using sysout.println, despite the ls was showing correct values.
As described in the documentation, the enviroment variable file.encoding should not be used to define charset and, in this case, the JVM ignores it
The symptom:
I could not type accents in shell.
ls was showing correct values
File.list() was printing incorrect values
the environ file.encoding was not affecting the output
the environ user.(language|country) was not affecting the output
The solution:
Although the enviroment variable LC_* was set in the shell with values inherited from /etc/defaut/init, as listed by set command, the locale showed different values.
$ set | grep LC
LC_ALL=pt_BR.ISO8859-1
LC_COLLATE=pt_BR.ISO8859-1
LC_CTYPE=pt_BR.ISO8859-1
LC_MESSAGES=C
LC_MONETARY=pt_BR.ISO8859-1
LC_NUMERIC=pt_BR.ISO8859-1
LC_TIME=pt_BR.ISO8859-1
$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
The solution was simple exporting LANG. This environment variable really affect the jvm
LANG=pt_BR.ISO8859-1
export LANG

Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.
In Java you can use following for reading and writing:
Reading:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));
Writing:
PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.