CSV encoding specification

CSV encoding specification - java

I am creating a CSV and writing content in UTF-8 to support German and English by specifying encoding as below
BufferedWriter outFile = new BufferedWriter( new OutputStreamWriter( outputStream, "UTF-8" ) );
The above is working fine till I add the below separator indication (;) in the header of CSV
outFile.write( "sep=;" );
outFile.newLine();
Without this delimiter ; my CSV will be wrong but when I inclde this the encoding is failing and UTf-8 not in place.
Is there any other keyword like "sep=" to specify in header of CSV to specify encoding?
I tried encoding="UTF-8" and it is not working.
Thanks.

You cannot open a UTF8 csv file with Excel 2007. Microsft have no understanding of the word "standards". Because of this, it is notoriously difficult to generate a csv file which opens in every possible application that reads .csv files and keeps the correct encoding.
If you must use Excel 2007, I would suggest using encoding with Microsofts own "windows 1252" as it supports German characters. Don't use the header, and also look in to using tab as a separator. Yes I know the c stands for comma, but tab seems to be more consistent with Excel 2007 if you save the file back again.

Related

Unzip files that contain chinese characters

I have a zip file.It contains some files.Files contain chinese characters so I used
ZipInputStream zipStream = new ZipInputStream(
new BufferedInputStream(new FileInputStream(zipFilePath), BUFFER_SIZE),
Charset.forName("ISO-8859-1")
);
......
FileOutputStream fileOutput = new FileOutputStream(uncompressedFileName);
while (zipStream.available() > 0) {
fileOutput.write(zipStream.read());
}
Extraction runs succesfully.After that I want to use encodingDetect method to find encoding but now service is not running.It returns nomatch. If I send files directly to service,The service is running.It find charset properly like UTF-8.
I guess that Charset.forName("ISO-8859-1")extract files but format is corrupted.Do you have any idea?

The problem is the Charset of the file names in the zip. UTF-8 raises an error (the file names are evidently not in UTF-8), as UTF-8 requires as special format for the multi-byte sequences, and evidently there are wrong "multibyte" sequences.
ISO-8859-1 is a single byte enconding, accepting garbage.
What you should do is to try the small number of Chinese Charsets, so the file name strings are filled correctly. Java String contains Unicode, so can hold any Charset. The help from someone talking Chinese probably would make sense.
And then try writing files with those names. If not successful on your PC, you must use artificial file names, maybe transliteration from Chinese.
A translation table from original Chinese file name to actual file name may be created
as UTF-8 text file, maybe with a BOM, '\uFEFF` at the begin-of-file.

ISO-8859-1 charset most definitely does not support Chinese language. Use UTF-8 instead of ISO-8859-1

export CSV with Hebrew char that can be opened by Excel

I know there is a lot of similar question but I still cannot find a solution,
my code:
resp.setContentType("text/csv; charset=UTF-8");
resp.addHeader("Content-Disposition", "attachment; filename=file.csv");
resp.addHeader("Pragma", "no-cache");
resp.addHeader("Expires", "0");
resp.addHeader("Content-Encoding", "UTF-8");
resp.getWriter().write(sb.toString());
The result is opened as a gibrish text instead of hebrew in excell.
How should I change this text so the result CSV can be opened in Excell (without importing it to excell)?
i.e. what charset should I used, and what is the correct code to generate it?

Do
sb.insert(0, '\uFEFF');
resp.getWriter().write(sb.toString());
This inserts a BOM character, a zero-width space, to mark the text as Unicode. It is redundant, even ugly. But in this way Excel & Notepad won't mistake the encoding for the platform encoding.
Caveat:
Now cell A1 cannot contain a number.

Unfortunately the encoding headers in your response won't be associated with the attachment; they're only for use when reading the response body.
Excel is oddly bad at handling UTF-8 encoded CSV files. Here's a good answer from the past on convincing it to behave; the other option would be to re-encode your CSV in Windows-1252.

How will append a utf-8 string to a properties file

How will append a utf-8 string to a properties file. I have given the code below.
public static void addNewAppIdToRootFiles() {
Properties properties = new Properties();
try {
FileInputStream fin = new FileInputStream("C:\Users\sarika.sukumaran\Desktop\root\root.properties");
properties.load(new InputStreamReader(fin, Charset.forName("UTF-8")));
String propertyStr = new String(("قسيمات").getBytes("iso-8859-1"), "UTF-8");
BufferedWriter bw = new BufferedWriter(new FileWriter(directoryPath + rootFiles, true));
bw.write(propertyStr);
bw.newLine();
bw.flush();
bw.close();
fin.close();
} catch (Exception e) {
System.out.println("Exception : " + e);
}
}
But when I open the file, the string I have written "قسيمات" to the file shows as "??????". Please help me.

OK, your first mistake is getBytes("iso-8859-1"). You should not do these manipulations at all. If you want to write unicode text to file you should open the file and write text. The internal representations of strings in java is unicdoe, so everything will be writter correctly.
You have to care about charset when you are reading file. BTW you do it correctly.
But you do not have to use file manipulation tools to append something to properites file. You can just call prop.setProperty("yourkey", "yourvalue") and then call prop.store(new FileOutputStream(youfilename)).

Ok, I have checked the specification for Properties class. If you use following methods: load() for input stream or store() for output stream, the input/output stream for the file is assumed a iso-8859-1 encoding by default. Therefore, you have to be cautious with a few things:
Some characters in French, German and Portuguese are iso-8859-1 (Latin1) compatible, which they normally work fine in iso-8859-1. So, you don't have to worry that much. But, others like Arabic and Hebrew characters are not Latin1 compatible, so you need to be careful with the choice of encoding for these characters. If you have a mix of characters of French and Arabic, you have no choice but to use Unicode.
What is your current input file's encoding if it already exists to be used with Properties's load() method? If it is not the default iso-8859-1, then you need to figure out what it is first before opening the file. If infile file encoding is UTF-8, then use properties.load(new InputStreamReader(new FileInputStream("infile"), "UTF8"))); Then, stick to this encoding till the end. Match the file encoding with the character encoding as well.
If it is a new input file to be used with Properties's load() method, choose the file encoding that works with your character's encoding. Then, stick to this encoding till the end.
Your expected output file's encoding shall be the same with what is used from Properties's load() method before you use the store() method. If it is not the default iso-8859-1, then you need to figure out what it is first before saving the file. Stick to this encoding till the end. Match the file encoding with the character encoding as well. If outfile file encoding is UTF-8, then specifically use UTF-8 encoding when saving the file. But, if the store() method still ends up with an outfile in iso-8859-1 encoding, then you need to do what is suggested next...
If you stick to the default iso-8859-1, it works fine for characters like French. But, if the characters are not iso-8859-1 or Latin1 encoding compatible, you need to use Unicode escape characters instead as an alternative: for example:\uFE94 for the Arabic ﺔ character. For me, this escaping is too tedious and normally we use native2ascii utility provided in JRE or JDK to convert a properties file from one encoding to another encoding. Of course, there are other ways...just check the references below...For me, it is better to use a properties file in XML format since by default it is UTF-8...
References:
Java properties UTF-8 encoding in Eclipse
Setting the default Java character encoding?

Spanish character óé display error in Java properties

When I process a properties file with the Spanish characters ó and é, characters are displayed as ?. I tried different ways to fix this, but still fail:
I tried to use \uxxxx
I tried to use InputStreamReader with encoding UTF-8
I tried to convert string to bytes and then create a new String from those bytes:
new String( val.getBytes("UTF-8"), "UTF-8")
Nothing worked. What should I do next to fix this issue? Japanese and Russian are still OK.

The properties file needs to be in the proper encoding. By default some IDE's like eclipse saves the content using CP1252 but you are requiring the file as UTF-8. This is also required for your java code.
If you try to use \uxxxx characters but your application by default is working with CP1252 the conversion of the escape code result in a bad character.
If you use the InputStreamReader to force the reading as UTF-8 but your code and/or your file are not using UTF-8 support result in a bad character.
If you use UTF-8 conversion of an string but your source code is CP1252 you should have the same problem.
Related previous answer about source code : Should source code be saved in UTF-8 format
Notepad ++ Has a menu to view the format of the file and change it in "Format" menu you should view the file as if it should be opened by other formarts or you should convert the file to other file formats like "UTF-8"

setting a UTF-8 in java and csv file [duplicate]

This question already has answers here:
How to add a UTF-8 BOM in Java?
(8 answers)
Closed 5 years ago.
I am using this code for add Persian words to a csv file via OpenCSV:
String[] entries="\u0645 \u062E\u062F\u0627".split("#");
try{
CSVWriter writer=new CSVWriter(new OutputStreamWriter(new FileOutputStream("C:\\test.csv"), "UTF-8"));
writer.writeNext(entries);
writer.close();
}
catch(IOException ioe){
ioe.printStackTrace();
}
When I open the resulting csv file, in Excel, it contains "ứỶờịỆ". Other programs such as notepad.exe don't have this problem, but all of my users are using MS Excel.
Replacing OpenCSV with SuperCSV does not solve this problem.
When I typed Persian characters into csv file manually, I don't have any problems.

I spent some time but found solution for your problem.
First I opened notepad and wrote the following line: שלום, hello, привет
Then I saved it as file he-en-ru.csv using UTF-8.
Then I opened it with MS excel and everything worked well.
Now, I wrote a simple java program that prints this line to file as following:
PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8"));
w.print(line);
w.flush();
w.close();
When I opened this file using excel I saw "gibrish."
Then I tried to read content of 2 files and (as expected) saw that file generated by notepad contains 3 bytes prefix:
239 EF
187 BB
191 BF
So, I modified my code to print this prefix first and the text after that:
String line = "שלום, hello, привет";
OutputStream os = new FileOutputStream("c:/temp/j.csv");
os.write(239);
os.write(187);
os.write(191);
PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8"));
w.print(line);
w.flush();
w.close();
And it worked! I opened the file using excel and saw text as I expected.
Bottom line: write these 3 bytes before writing the content. This prefix indicates that the content is in 'UTF-8 with BOM' (otherwise it is just 'UTF-8 without BOM').

Unfortunately, CSV is a very ad hoc format with no metadata and no real standard that would mandate a flexible encoding. As long as you use CSV, you can't reliably use any characters outside of ASCII.
Your alternatives:
Write to XML (which does have encoding metadata if you do it right) and have the users import the XML into Excel.
Use Apache POI to create actual Excel documents.

Excel doesn't use UTF8 to open CSV files. Thats a known problem. The actual encoding used depends on the locale settings of Microsoft Windows. With a German lcoale for example Excel would open a CSV file with CP1252.
You could create an Excel file containing some persian characters and save it as an CSV file. Then write a small Java program to read this file and test some common encodings. Thats the way I used to figure out the correct encoding for German umlauts in CSV files.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

CSV encoding specification - java

Related

Unzip files that contain chinese characters

export CSV with Hebrew char that can be opened by Excel

How will append a utf-8 string to a properties file

Spanish character óé display error in Java properties

setting a UTF-8 in java and csv file [duplicate]

Categories

Resources