Unicode working on Windows but not Red Hat Linux : Java

Unicode working on Windows but not Red Hat Linux : Java - java

In Java, I am generating file having Unicode characters.
When I run my program in Windows (Jboss) and open the file (CSV). It finely displays Unicode characters (Norwegin and Icelandic) in excel.
But when I deploy the same in server inside Red Hat Linux (in Jboss same version), run the program, generate file and download and when I see that in excel then it is distorting all Unicode characters.
Could you please suggest any local Linux setting due to which Unicode is distorting? or where change is required?
FileWriter writer = new FileWriter(fileName);
writer.append(new String(data.toString().getBytes("UTF-8"),"UTF-8"));
writer.flush();
writer.close();
//data is StringBuilder type
I have also tried ISO8859_1
Update 1
I have checked System Encoding: using System.getProperty("file.encoding") and found that
Windows is Cp1252 and Linux is UTF-8
Update 2
When I print in Linux using :
log.info(new String(data.toString().getBytes("UTF-8"), "UTF-8"));
it is showing all output perfectly fine but when I put it in FileWriter with extension filename.csv, it is not displaying correctly.

It looks like you are translating from bytes
data
to String
data.toString()
to Bytes
data.toString().getBytes("UTF-8")
to String
new String(data.toString().getBytes("UTF-8"),"UTF-8"))
to bytes
writer.append(new String(data.toString().getBytes("UTF-8"),"UTF-8"));
Try just a single conversion from the input encoding to a String and then write out the String. So the data.toString() needs to know what encoding it is reading. Does data support conversion from different code pages?
writer.append(data.toString(codepage));

Related

Writing hebrew to file turns to gibberish when run from exported jar

I have a small program that writes some Hebrew letters and some numbers to a file, written in JAVA.
The Hebrew is written fine when i run the program from Eclipse, but if i export it into an executable JAR file and run it from there the Hebrew turns to gibberish
My code:
if (content.length() > 0) {
FileWriter fileWriter = new FileWriter(path);
BufferedWriter bufferedWriter = new BufferedWriter(fileWriter);
bufferedWriter.write(content);
bufferedWriter.close();
}
I have also tried using an OutputStreamWriter to set the encoding myself:
if (content.length() > 0) {
BufferedWriter bufferedWriter = new BufferedWriter
(new OutputStreamWriter(new FileOutputStream(path), "windows-1255"));
bufferedWriter.write(content);
bufferedWriter.close();
}
The encodings i tried:
ISO-8859-8
windows-1255
x-IBM856
IBM862
IBM424
UTF-8
Some of them return proper Hebrew when i run the program from eclipse but all of them turn the Hebrew to different types of gibberish when run from the JAR file.
I am not even sure the encoding in the code itself is the issue or the way to fix it.
I am running the JAR using a batch file on windows 10.
My java version info:
java version "10.0.1" 2018-04-17
Java(TM) SE Runtime Environment 18.3 (build 10.0.1+10)
Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.1+10, mixed mode)
example of output when using UTF-8
A line from the Hebrew file (generated by eclipse):
210001 188 13 04/09/1804/09/18 50.00 1 123456789 לירון קטלן הרא"ה 291 רמת גן 6013
The same line from the gibberish file (generated from the JAR):
210001 188 13 04/09/1804/09/18 50.00 1 123456789 ×œ×™×¨×•×Ÿ ×§×˜×œ×Ÿ ×”×¨×�"×” 291 ×¨×ž×ª ×’×Ÿ 6013
Don't mind the extra white-spaces, they are supposed to be there.

The second code snippet with explicit encoding is correctly crossplatform.
Check that the content is fine Unicode:
String content="\u200F\u05D0\u05D1\u05D2\u05D3\u05D4\u200E"; // "אבגדהו"
I used u-encoding so the java source is ASCII and hence the encoding of the java compiler and the encoding of the editor should the erroneously differ, cannot cause
corrupt strings.
Assuming that content is a String:
if (!content.isEmpty()) {
content = "\uFEFF" + content; // Add a BOM char in front for Windows
Path p = Paths.get(path);
Files.write(p, Collections.singletonList(content), StandardCharsets.UTF_8);
}
This writes a UTF-8 file that will cause the least of problems, unless inside Israel, where one might assume a country specific encoding, windows-1255.
I added a BOM character as first character of the file, so Windows can easily identify the file, not as some ANSI single-byte encoding, but as UTF-8 Unicode.
Then there rests the problem of representing Hebrew text. There must be an adequate font.
You might opt for writing an HTML file:
content = "<!DOCTYPE html><html lang="he">"
+ "<head><meta charset=\"utf-8\"></head>"
+ "<body><pre>"
+ content.replace("&", "&")
.replace("<", "<")
.replace(">", "&gt")
+ "</pre></body></html>";
I find that better than writing a BOM.
The last thing is to add LTR ('\u200E') and RTL (Right-To-Left, '\u200F') mark chars, but I take it, that that gives no problem.
It always is that at some place an overloaded method is used, where the encoding is not present, defaulting to the current platform encoding.
Do
new InputStreamReader(..., StandardCharsets.UTF_8))
and such.

Java Character Encoding Writing to Text File

My Issue is as follows:
Having issue with character encoding when writing to text file. The issue is characters are not showing the intended value. for example I am writing ' '(which is probably a Tab character) and 'Â' is what is displayed in the text file.
Background information
This data is being stored on a MSQL Database. The Database Collation is SQL_Latin1_General_CP1_CI_AS and the fields are varchar. I've come to learn the collation and type determine what character encoding is used on the database side. Values are stored correctly so no issues here.
My Java application runs queries to pull the data from the DB and this too also looks OK. I have debugged the code and seen all the Strings have the correct representation before writing to the file.
Next I write the text to the .TXT file using a OutputStreamWriter as follows:
public OfferFileBuilder(String clientAppName, boolean isAppend) throws IOException, URISyntaxException {
String exportFileLocation = getExportedFileLocation();
File offerFile = new File(getDatedFileName(exportFileLocation+"/"+clientAppName+"_OFFERRECORDS"));
bufferedWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(offerFile, isAppend), "UTF-8"));
}
Now once I open up the file on the Linux server by running cat command on file or open up the file using notepad++ some of the characters are incorrectly displaying.
I've ran the following commands on the server to see its encoding locale charmap which prints UTF-8, echo $LANG which prints en_US.UTF-8, and echo $LC_CTYPE` prints nothing.
Here is what I've attempted so far.
I've attempted to change the Character encoding used by the OutputStreamWriter I've tried UTF-8, and CP1252. When switching encoding some characters are fixed when others are then improperly displayed.
My Question is this:
Which encoding should my OutputStreamWriter be using?
(Bonus Questions) how are we supposed to avoid issues like this from happening. The rule of thumb i was provided was use UTF-8 and you will never run into problems, but this isn't the case for me right now.

running file -bi command on the server revealed that the file was encoded with ascii instead of utf8. Removing the file completely and rerunning the process fixed this for me.

Java reading from binary file

I have some calculated data (floats and integers), which is written to a 12mb file like this
DataOutputStream os3;
os3 = new DataOutputStream(new FileOutputStream(Cache.class.getResource("/3w.dat").getPath()));
...... (some loops)
os3.writeFloat(f);
os3.writeInt(r);
os3.close();
And after that I read it this way
DataInputStream is3;
is3 = new DataInputStream(new FileInputStream(Cache.class.getResource("/3w.dat").getPath()));
...... (same loops)
is3.readFloat();
is3.readInt();
is3.close();
So, I wrote file only once on Windows 7. After that I only read it while app starts. File reading works fine on Windows 7, but when I try do to it on Ubuntu, I get EOF exception (code and file are the same).
I suspect problem may be caused by some NaN values written to the file.
BTW. While debugging I figured out that on ubuntu reading process runs about 15% of loop and throws an exception. All values it reads are "0.0", but file doesnt contain zeros.

The problem should be the different of line break between Linux and Windows. As mentioned by Clark, just replace the \n with \r\n and the end of line character could resolve this. You can use Notpad++ to simplify this work. Just click on Menu :
Edit -> EOL conversion -> Convert to UNIX format
It might be also a problem of encoding. If you'are using Windows, the default encoding is Windows-1252 or so called CP-1252, which is specially microsoft thing and cannot be recognized on Linux. Just change it into an encoding like UTF-8 which can be recognized by all OS. Using Notpad++:
Encoding -> Convert to UTF-8

How can I change the Standard Out to "UTF-8" in Java

I download a file from a website using a Java program and the header looks like below
Content-Disposition attachment;filename="Textkürzung.asc";
There is no encoding specified
What I do is after downloading I pass the name of the file to another application for further processing. I use
System.out.println(filename);
In the standard out the string is printed as Textk³rzung.asc
How can I change the Standard Out to "UTF-8" in Java?
I tried to encode to "UTF-8" and the content is still the same
Update:
I was able to fix this without any code change. In the place where I call this my jar file from the other application, i did the following
java -DFile.Encoding=UTF-8 -jar ....
This seem to have fixed the issue
thank you all for your support

The default encoding of System.out is the operating system default. On international versions of Windows this is usually the windows-1252 codepage. If you're running your code on the command line, that is also the encoding the terminal expects, so special characters are displayed correctly. But if you are running the code some other way, or sending the output to a file or another program, it might be expecting a different encoding. In your case, apparently, UTF-8.
You can actually change the encoding of System.out by replacing it:
try {
System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, "UTF-8"));
} catch (UnsupportedEncodingException e) {
throw new InternalError("VM does not support mandatory encoding UTF-8");
}
This works for cases where using a new PrintStream is not an option, for instance because the output is coming from library code which you cannot change, and where you have no control over system properties, or where changing the default encoding of all files is not appropriate.

The result you're seeing suggests your console expects text to be in Windows "code page 850" encoding - the character ü has Unicode code point U+00FC. The byte value 0xFC renders in Windows code page 850 as ³. So if you want the name to appear correctly on the console then you need to print it using the encoding "Cp850":
PrintWriter consoleOut = new PrintWriter(new OutputStreamWriter(System.out, "Cp850"));
consoleOut.println(filename);
Whether this is what your "other application" expects is a different question - the other app will only see the correct name if it is reading its standard input as Cp850 too.

Try to use:
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(test);

Reading Arabic chars from text file

I had finished a project in which I read from a text file written with notepad.
The characters in my text file are in Arabic language,and the file encoding type is UTF-8.
When launching my project inside Netbeans(7.0.1) everything seemed to be ok,but when I built the project as a (.jar) file the characters where displayed in this way: ÇáãæÇÞÚááÊØæíÑ.
How could I solve This problem please?

Most likely you are using JVM default character encoding somewhere. If you are 100% sure your file is encoded using UTF-8, make sure you explicitly specify UTF-8 when reading as well. For example this piece of code is broken:
new FileReader("file.txt")
because it uses JVM default character encoding - which you might not have control over and apparently Netbeans uses UTF-8 while your operating system defines something different. Note that this makes FileReader class completely useless if you want your code to be portable.
Instead use the following code snippet:
new InputStreamReader(new FileInputStream("file.txt"), "UTF-8");
You are not providing your code, but this should give you a general impression how this should be implemented.

Maybe this example will help a little. I will try to print content of utf-8 file to IDE console and system console that is encoded in "Cp852".
My d:\data.txt contains ąźżćąś adsfasdf
Lets check this code
//I will read chars using utf-8 encoding
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream("d:\\data.txt"), "utf-8"));
//and write to console using Cp852 encoding (works for my windows7 console)
PrintWriter out = new PrintWriter(new OutputStreamWriter(System.out,
"Cp852"),true); // "Cp852" is coding used in
// my console in Win7
// ok, lets read data from file
String line;
while ((line = in.readLine()) != null) {
// here I use IDE encoding
System.out.println(line);
// here I print data using Cp852 encoding
out.println(line);
}
When I run it in Eclipse output will be
ąźżćąś adsfasdf
Ą«ľ†Ą? adsfasdf
but output from system console will be

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.