encode œ correctly with UTF-8 - java

encode œ correctly with UTF-8 - java - java

I am having problems to write out the following string into a file correctly. Especially with the character "œ". The Problem appears on my local machine (Windows 7) and on the server (Linux)
String: "Cœurs d’artichauts grillées"
Does Work (œ gets displays correctly, while the apostrophe get translated into a question mark):
Files.write(path, content.getBytes(StandardCharsets.ISO_8859_1));
Does not work (result in file):
Files.write(path, content.getBytes(StandardCharsets.UTF_8));
According to the first answer of this question, UTF-8 should be able to encode the œ correctly as well. Has anyone have an idea what i am doing wrong?

Your second approach works
String content = "Cœurs d’artichauts grillées";
Path path = Paths.get("out.txt");
Files.write(path, content.getBytes(Charset.forName("UTF-8")));
Is producing an out.txt file with:
Cœurs d’artichauts grillées
Most likely the editor you are using is not displaying the content correctly. You might have to force your editor to use the UTF-8 encoding and a font that displays œ and other UTF-8 characters. Notepad++ or IntelliJ IDEA work out of the box.

Related

A problem in showing Unicode Character when running Java artifact, but everything ok while running in IntelliJ IDEA

The goal is to read from the database and write the records into a file.
When running code in IntelliJ IDEA, it writes Unicode characters as same as database content.
But when I build the artifact (Jar File) and run it in windows, the output file shows question mark character '?' instead of showing Database content correctly.
In another word, Although English characters and numbers are showing correctly, Problem occurs in non-English characters (e.g. Persian characters, Arabic or ...)
related parts of java code:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile.txt , true), "cp1256"));
while (resultSet.next()) {
try {
singleRow = resultSet.getString("CODE") + "|"
+ resultSet.getString("ACTIVITY") + "|"
+ resultSet.getString("TEL") + "|"
+ resultSet.getString("ZIPCD") + "|"
+ resultSet.getString("ADDR");
} catch (Exception e) {
LogUtil.writeLog(Constants.LOG_ERROR, e.getMessage());
}
out.write(singleRow + System.getProperty("line.separator"));
}
Output file content by running IntelliJ IDEA DEBUG mode:
130143|Active|ابتداي بلوار ميرداماد،کوچه سوم پلاک پنج|524|35254410
190730|Active|خیابان زیتون، بین انوشه و زیبا پلاک یک|771|92542001
Output file content by running corresponding JAR File:
130143|Active|35254410|524|??? ? ??? ??????? ????? ????
190730|Active|92542001|771|????? ??? ??????? ????? ??? ??
Could you please tell me what is wrong with the program?

You must change your code as follows:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile.txt , true), StandardCharsets.UTF_8));
while (resultSet.next()) {
try {
singleRow = resultSet.getString("CODE") + "|"
+ resultSet.getString("ACTIVITY") + "|"
+ resultSet.getString("TEL") + "|"
+ resultSet.getString("ZIPCD") + "|"
+ resultSet.getString("ADDR") ;
} catch (Exception e) {
LogUtil.writeLog(Constants.LOG_ERROR, e.getMessage());
}
byte[] bytes = singleRow.getBytes(StandardCharsets.UTF_8);
String utf8EncodedString = new String(bytes, StandardCharsets.UTF_8);
out.write(utf8EncodedString + System.getProperty("line.separator"));
}
String.getBytes() uses the system default character set.You can see your environment charset via :
System.out.println("Charset.defaultCharset="+ Charset.defaultCharset());
When running from IntelliJ , the system default character set is taken from IntelliJ environment.
When running from JAR file, the system default character set is taken from the Operating system (Explained at the end).
Because of the different charset of your windows and IntelliJ environment, you get different output.
It is highly recommended to explicitly specify "ISO-8859-1" or "US-ASCII" or "UTF-8" or whatever character set you to want when converting bytes into Strings of vice-versa
singleRow.getBytes(StandardCharsets.UTF_8)
see this link for more ionformation
what are Windows-1252 and Windows-1256 ?
Windows-1252
Windows-1252 or CP-1252 (code page 1252) is a single-byte(0-255) character.
encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German.
The first 128 code (0-127) is the same as the standard ASCII code. The other codes(128-255) depend on system language ( Spanish, French, German).
Windows-1256
Windows-1256 is a code page used to write Arabic (and possibly some other languages that use Arabic script, like Persian and Urdu) under Microsoft Windows.
These are some Windows-1252 Latin characters used for French since this European language has some historic relevance in former French colonies in North Africa. This allowed French and Arabic text to be intermixed when using Windows 1256 without any need for code-page switching (however, upper-case letters with diacritics were not included).
What should I Do when using Unicode(persian) characters?
Because of existing some different characters that have similar notations such as “ی” and “ي” in Persian, this encoding will replace “ی” (U+06cc) with “ي”( U+064a), because Windows-1256 has not U+06cc character.
for Persian, instate of using Windows-1256 use UTF-8 encoding to avoid encoding problems.
Consider that Windows-1256 uses only 1 byte and UTF-8 take more bytes (1 to 4 bytes.)
A comparison of these encoding are here
How to change windows Default character set?
now on Microsoft windows Windows-1252 is the default encoding used by Windows systems in most western countries.
To change your Microsoft windows default character set to suitable Unicode follow this .
If you change as follows to Persian, your default charset will be changed to Windows-1256
How to change specific software character set (some for programming)?
you must change your specific software Unicode as it’s instructions.
1- for notepad++
2- on xml file or field
3- For IntelliJ files
Open the desired file for editing.
From the main menu, select File | File encoding or click the file encoding on the status bar.
Select the desired encoding from the popup.
If or is displayed next to the selected encoding, it means that this encoding might change the file contents. In this case, IntelliJ IDEA opens a dialog where you can decide what you want to do with the file: choose Reload to load the file in the editor from disk and apply encoding changes to the editor only, or choose Convert to overwrite the file with the encoding of your choice.
4-IntelliJ Console output encoding
IntelliJ IDEA creates files using the IDE encoding defined in the File Encodings page of the Settings / Preferences dialog Ctrl+Alt+S. You can use either the system default or select from the list of available encodings. By default, this encoding affects console output. If you want the encoding for console output to be different from the global IDE settings, configure the corresponding JVM option:
On the Help menu, click Edit Custom VM Options.
Add the -Dconsole.encoding option and set the value to the necessary encoding. For example: -Dconsole.encoding=UTF-8
Restart IntelliJ IDEA.

Java saving a file with special characters in file name

I'm having a problem on Java file encoding.
I have a Java program will save a input stream as a file with a given file name, the code snippet is like:
File out = new File(strFileName);
Files.copy(inStream, out.toPath());
It works fine on Windows unless the file name contains some special characters like Ö, with these characters in the file name, the saved file will display a garbled file name on Windows.
I understand that by applying JVM option -Dfile.encoding=UTF-8 this issue can be fixed, but I would have a solution in my code rather than ask all my users to change their JVM options.
While debugging the program I can see the file name string always shows the correct character, so I guess the problem is not about internal encoding.
Could someone please explain what went wrong behind the scene? and is there a way to avoid this problem programmatically? I tried get the bytes from the string and change the encoding but it doesn't work.
Thanks.

Using the URLEncoder class would work:
String name = URLEncoder.encode("fileName#", "UTF-8");
File output = new File(name);

Java Character Encoding Writing to Text File

My Issue is as follows:
Having issue with character encoding when writing to text file. The issue is characters are not showing the intended value. for example I am writing ' '(which is probably a Tab character) and 'Â' is what is displayed in the text file.
Background information
This data is being stored on a MSQL Database. The Database Collation is SQL_Latin1_General_CP1_CI_AS and the fields are varchar. I've come to learn the collation and type determine what character encoding is used on the database side. Values are stored correctly so no issues here.
My Java application runs queries to pull the data from the DB and this too also looks OK. I have debugged the code and seen all the Strings have the correct representation before writing to the file.
Next I write the text to the .TXT file using a OutputStreamWriter as follows:
public OfferFileBuilder(String clientAppName, boolean isAppend) throws IOException, URISyntaxException {
String exportFileLocation = getExportedFileLocation();
File offerFile = new File(getDatedFileName(exportFileLocation+"/"+clientAppName+"_OFFERRECORDS"));
bufferedWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(offerFile, isAppend), "UTF-8"));
}
Now once I open up the file on the Linux server by running cat command on file or open up the file using notepad++ some of the characters are incorrectly displaying.
I've ran the following commands on the server to see its encoding locale charmap which prints UTF-8, echo $LANG which prints en_US.UTF-8, and echo $LC_CTYPE` prints nothing.
Here is what I've attempted so far.
I've attempted to change the Character encoding used by the OutputStreamWriter I've tried UTF-8, and CP1252. When switching encoding some characters are fixed when others are then improperly displayed.
My Question is this:
Which encoding should my OutputStreamWriter be using?
(Bonus Questions) how are we supposed to avoid issues like this from happening. The rule of thumb i was provided was use UTF-8 and you will never run into problems, but this isn't the case for me right now.

running file -bi command on the server revealed that the file was encoded with ascii instead of utf8. Removing the file completely and rerunning the process fixed this for me.

Unicode working on Windows but not Red Hat Linux : Java

In Java, I am generating file having Unicode characters.
When I run my program in Windows (Jboss) and open the file (CSV). It finely displays Unicode characters (Norwegin and Icelandic) in excel.
But when I deploy the same in server inside Red Hat Linux (in Jboss same version), run the program, generate file and download and when I see that in excel then it is distorting all Unicode characters.
Could you please suggest any local Linux setting due to which Unicode is distorting? or where change is required?
FileWriter writer = new FileWriter(fileName);
writer.append(new String(data.toString().getBytes("UTF-8"),"UTF-8"));
writer.flush();
writer.close();
//data is StringBuilder type
I have also tried ISO8859_1
Update 1
I have checked System Encoding: using System.getProperty("file.encoding") and found that
Windows is Cp1252 and Linux is UTF-8
Update 2
When I print in Linux using :
log.info(new String(data.toString().getBytes("UTF-8"), "UTF-8"));
it is showing all output perfectly fine but when I put it in FileWriter with extension filename.csv, it is not displaying correctly.

It looks like you are translating from bytes
data
to String
data.toString()
to Bytes
data.toString().getBytes("UTF-8")
to String
new String(data.toString().getBytes("UTF-8"),"UTF-8"))
to bytes
writer.append(new String(data.toString().getBytes("UTF-8"),"UTF-8"));
Try just a single conversion from the input encoding to a String and then write out the String. So the data.toString() needs to know what encoding it is reading. Does data support conversion from different code pages?
writer.append(data.toString(codepage));

Spanish character óé display error in Java properties

When I process a properties file with the Spanish characters ó and é, characters are displayed as ?. I tried different ways to fix this, but still fail:
I tried to use \uxxxx
I tried to use InputStreamReader with encoding UTF-8
I tried to convert string to bytes and then create a new String from those bytes:
new String( val.getBytes("UTF-8"), "UTF-8")
Nothing worked. What should I do next to fix this issue? Japanese and Russian are still OK.

The properties file needs to be in the proper encoding. By default some IDE's like eclipse saves the content using CP1252 but you are requiring the file as UTF-8. This is also required for your java code.
If you try to use \uxxxx characters but your application by default is working with CP1252 the conversion of the escape code result in a bad character.
If you use the InputStreamReader to force the reading as UTF-8 but your code and/or your file are not using UTF-8 support result in a bad character.
If you use UTF-8 conversion of an string but your source code is CP1252 you should have the same problem.
Related previous answer about source code : Should source code be saved in UTF-8 format
Notepad ++ Has a menu to view the format of the file and change it in "Format" menu you should view the file as if it should be opened by other formarts or you should convert the file to other file formats like "UTF-8"

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.