Generate CSV via Apache CSV in UTF-8

Generate CSV via Apache CSV in UTF-8 - java

how to write CSV File in UTF-8 via Apache CSV?
I am trying generate csv by following code where Files.newBufferedWriter() encode text into UTF-8 by default, but when I open generated text in excel there are senseless characters.
I create CSVPrinter like this:
CSVPrinter csvPrinter = new CSVPrinter(Files.newBufferedWriter(Paths.get(filePath)), CSVFormat.EXCEL);
next I set headers
csvPrinter.printRecord(headers);
and next in loop I print values into writer like this
csvPrinter.printRecord("value1", "valu2", ...);
I also tried upload file into online CSV lint validator and it tells that I am using ASCII-8BIT instead of UTF-8. What I did wrong?

Microsoft software tends to assume windows-12* or UTF-16LE charsets, unless the content starts with a byte order mark which the software will use to identify the charset. Try adding a byte order mark at the start of your file:
try (BufferedWriter writer = Files.newBufferedWriter(Paths.get(filePath))) {
writer.write('\ufeff');
CSVPrinter csvPrinter = new CSVPrinter(writer);
//...
}

Related

CSVPrinter with break line characters

I'm using org.apache.commons.csv.CSVPrinter (Java 8) in order to produce a CSV text file starting from a DB RecordSet. I have a description field in my DB table on where the user can insert whatever he want, such as a new line!
As I import the CSV on Excel or Google Spreadsheet each line with a new line character in the description corrupts the CSV structure, obviously.
Should I replace/remove these characters manually or is there a way to configure CSVPrinter in order to remove it automatically?
Thank you all in advance.
F
Edit: here a code snippet:
CSVFormat csvFormat = CSVFormat.DEFAULT.withRecordSeparator("\n").withQuoteMode(QuoteMode.ALL).withQuote('"');
CSVPrinter csvPrinter = new CSVPrinter(csvContent, csvFormat);
// prepare a list of string gathered from the DB. I explicitly use a String array because I need to perform some text editing to DB content before writing it in the CSV
List fasciaOrariaRecord = new ArrayList();
fasciaOrariaRecord.add(...);
fasciaOrariaRecord.add(...);
// ...
csvPrinter.printRecord(csvHeader);
// more rows...
csvPrinter.close();

Any value with line endings should be escaped with quotes. If your CSV library is not doing this for you automatically I'd recommend using univocity-parsers. In your particular case, there is a pre-built routine you can use to dump database contents into CSV.
Try this:
ResultSet resultSet = statement.executeQuery("SELECT * FROM table");
//Get a CSV writer settings object pre-configured for Excel
CsvWriterSettings writerSettings = Csv.writeExcel();
writerSettings.setHeaderWritingEnabled(true); //writes the column names to the output file
CsvRoutines routines = new CsvRoutines(writerSettings);
//use an encoding Excel likes
routines.write(resultSet, new File("/path/to/output.csv"), "windows-1252");
Hope this helps.
Disclaimer: I'm the author of this library. It's open source and free (Apache 2.0 license)

Error Parsing due to CSV Differences Before/After Saving (Java w/ Apache Commons CSV)

I have a 37 column CSV file that I am parsing in Java with Apache Commons CSV 1.2. My setup code is as follows:
//initialize FileReader object
FileReader fileReader = new FileReader(file);
//intialize CSVFormat object
CSVFormat csvFileFormat = CSVFormat.DEFAULT.withHeader(FILE_HEADER_MAPPING);
//initialize CSVParser object
CSVParser csvFileParser = new CSVParser(fileReader, csvFileFormat);
//Get a list of CSV file records
List<CSVRecord> csvRecords = csvFileParser.getRecords();
// process accordingly
My problem is that when I copy the CSV to be processed to my target directory and run my parsing program, I get the following error:
Exception in thread "main" java.lang.IllegalArgumentException: Index for header 'Title' is 7 but CSVRecord only has 6 values!
at org.apache.commons.csv.CSVRecord.get(CSVRecord.java:110)
at launcher.QualysImport.createQualysRecords(Unknown Source)
at launcher.QualysImport.importQualysRecords(Unknown Source)
at launcher.Main.main(Unknown Source)
However, if I copy the file to my target directory, open and save it, then try the program again, it works. Opening and saving the CSV adds back the commas needed at the end so my program won't compain about not having enough headers to read.
For context, here is a sample line of before/after saving:
Before (failing): "data","data","data","data"
After (working): "data","data",,,,"data",,,"data",,,,,,
So my question: why does the CSV format change when I open and save it? I'm not changing any values or encoding, and the behavior is the same for MS-DOS or regular .csv format when saving. Also, I'm using Excel to copy/open/save in my testing.
Is there some encoding or format setting I need to be using? Can I solve this programmatically?
Thanks in advance!
EDIT #1:
For additional context, when I first view an empty line in the original file, it just has the new line ^M character like this:
^M
After opening in Excel and saving, it looks like this with all 37 of my empty fields:
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,^M
Is this a Windows encoding discrepancy?

Maybe that's a compatibility issue with whatever generated the file in the first place. It seems that Excel accepts a blank line as a valid row with empty strings in each column, with the number of columns to match some other row(s). Then it saves it according to CSV conventions with the column delimiter.
(the ^M is the Carriage Return character; on Microsoft systems it precedes the Line Feed character at the end of a line in text files)
Perhaps you can deal with it by creating your own Reader subclass to sit between the FileReader and the CSVParser. Your reader will read a line, and if it is blank then return a line with the correct number of commas. Otherwise just return the line as-is.
For example:
class MyCSVCompatibilityReader extends BufferedReader
{
private final BufferedReader delegate;
public MyCSVCompatibilityReader(final FileReader fileReader)
{
this.delegate = new BufferedReader(fileReader);
}
#Override
public String readLine()
{
final String line = this.delegate.readLine();
if ("".equals(line.trim())
{ return ",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"; }
else
{ return line; }
}
}
There are a lot of other details to implement correctly when implementing the interface. You'll need to pass through calls to all the other methods (close, ready, reset, skip, etc.), and ensure that each of the various read() methods work correctly. It might be easier, if the file will fit in memory easily, to just read the file and write the fixed version to a new StringWriter then create a StringReader to the CSVParser.

Maybe try this:
Creates a parser for the given File.
parse(File file, Charset charset, CSVFormat format)
//import import java.nio.charset.StandardCharsets;
//StandardCharsets.UTF_8
Note: This method internally creates a FileReader using FileReader.FileReader(java.io.File) which in turn relies on the default encoding of the JVM that is executing the code.

Or maybe try withAllowMissingColumnNames?
//intialize CSVFormat object
CSVFormat csvFileFormat = CSVFormat.DEFAULT.withHeader(FILE_HEADER_MAPPING).withAllowMissingColumnNames();

CSV encoding specification

I am creating a CSV and writing content in UTF-8 to support German and English by specifying encoding as below
BufferedWriter outFile = new BufferedWriter( new OutputStreamWriter( outputStream, "UTF-8" ) );
The above is working fine till I add the below separator indication (;) in the header of CSV
outFile.write( "sep=;" );
outFile.newLine();
Without this delimiter ; my CSV will be wrong but when I inclde this the encoding is failing and UTf-8 not in place.
Is there any other keyword like "sep=" to specify in header of CSV to specify encoding?
I tried encoding="UTF-8" and it is not working.
Thanks.

You cannot open a UTF8 csv file with Excel 2007. Microsft have no understanding of the word "standards". Because of this, it is notoriously difficult to generate a csv file which opens in every possible application that reads .csv files and keeps the correct encoding.
If you must use Excel 2007, I would suggest using encoding with Microsofts own "windows 1252" as it supports German characters. Don't use the header, and also look in to using tab as a separator. Yes I know the c stands for comma, but tab seems to be more consistent with Excel 2007 if you save the file back again.

setting a UTF-8 in java and csv file [duplicate]

This question already has answers here:
How to add a UTF-8 BOM in Java?
(8 answers)
Closed 5 years ago.
I am using this code for add Persian words to a csv file via OpenCSV:
String[] entries="\u0645 \u062E\u062F\u0627".split("#");
try{
CSVWriter writer=new CSVWriter(new OutputStreamWriter(new FileOutputStream("C:\\test.csv"), "UTF-8"));
writer.writeNext(entries);
writer.close();
}
catch(IOException ioe){
ioe.printStackTrace();
}
When I open the resulting csv file, in Excel, it contains "ứỶờịỆ". Other programs such as notepad.exe don't have this problem, but all of my users are using MS Excel.
Replacing OpenCSV with SuperCSV does not solve this problem.
When I typed Persian characters into csv file manually, I don't have any problems.

I spent some time but found solution for your problem.
First I opened notepad and wrote the following line: שלום, hello, привет
Then I saved it as file he-en-ru.csv using UTF-8.
Then I opened it with MS excel and everything worked well.
Now, I wrote a simple java program that prints this line to file as following:
PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8"));
w.print(line);
w.flush();
w.close();
When I opened this file using excel I saw "gibrish."
Then I tried to read content of 2 files and (as expected) saw that file generated by notepad contains 3 bytes prefix:
239 EF
187 BB
191 BF
So, I modified my code to print this prefix first and the text after that:
String line = "שלום, hello, привет";
OutputStream os = new FileOutputStream("c:/temp/j.csv");
os.write(239);
os.write(187);
os.write(191);
PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8"));
w.print(line);
w.flush();
w.close();
And it worked! I opened the file using excel and saw text as I expected.
Bottom line: write these 3 bytes before writing the content. This prefix indicates that the content is in 'UTF-8 with BOM' (otherwise it is just 'UTF-8 without BOM').

Unfortunately, CSV is a very ad hoc format with no metadata and no real standard that would mandate a flexible encoding. As long as you use CSV, you can't reliably use any characters outside of ASCII.
Your alternatives:
Write to XML (which does have encoding metadata if you do it right) and have the users import the XML into Excel.
Use Apache POI to create actual Excel documents.

Excel doesn't use UTF8 to open CSV files. Thats a known problem. The actual encoding used depends on the locale settings of Microsoft Windows. With a German lcoale for example Excel would open a CSV file with CP1252.
You could create an Excel file containing some persian characters and save it as an CSV file. Then write a small Java program to read this file and test some common encodings. Thats the way I used to figure out the correct encoding for German umlauts in CSV files.

Convert from Codepage 1252 (Windows) to Java, in Java

I have some strings in Java (originally from an Excel sheet) that I presume are in Windows 1252 codepage. I want them converted to Javas own unicode format. The Excel file was parsed using the JXL package, in case that matter.
I will clarify: apparently the strings gotten from the Excel file look pretty much like it already is some kind of unicode.
WorkbookSettings ws = new WorkbookSettings();
ws.setCharacterSet(someInteger);
Workbook workbook = Workbook.getWorkbook(new File(filename), ws);
Sheet s = workbook.getSheet(sheet);
row = s.getRow(4);
String contents = row[0].getContents();
This is where contents seems to contain something unicode, the åäö are multibyte characters, while the ASCII ones are normal single byte characters. It is most definitely not Latin1. If I print the "contents" string with printLn and redirect it to a hello.txt file, I find that the letter "ö" is represented with two bytes, C3 B6 in hex. (195 and 179 in decimal.)
[edit]
I have tried the suggestions with different codepages etc given below, tried converting from Cp1252 etc. There was some kind of conversion, because I would get some other kind of gibberish instead. As reference I always printed an "ö" string hand coded into the source code, to verify that there was not something wrong with my terminal or typefaces or anything. The manually typed "ö" always worked.
[edit]
I also tried WorkBookSettings as suggested in the comments, but I looked in the code for JXL and characterSet seems to be ignored by parsing code. I think the parsing code just looks at whatever encoding the XLS file is supposed to be in.

WorkbookSettings ws = new WorkbookSettings();
ws.setEncoding("CP1250");
Worked for me.

If none of the answer above solve the problem, the trick might be done like this:
String myOutput = new String (myInput, "UTF-8");
This should decode the incoming string, whatever its format.

When Java parses a file it uses some encoding to read the bytes on the disk and create bytes in memory. The default encoding varies from platform to platform. Java's internal String representation is Unicode already, so if it parses the file with the right encoding then you are already done; just write out the data in any encoding you want.
If your strings appear corrupted when you look at them in Java, it is probably because you are using the wrong encoding to read the data. Excel is probably using UTF-16 (Little-Endian I think) but I'd expect a library like JXL should be able to detect it appropriately. I've looked at the Javadocs for JXL and it doesn't do anything with character encodings. I imagine it auto-detects any encodings as it needs to.
Do you just need to write the already loaded strings to a text file? If so, then something like the following will work:
String text = getCP1252Text(); // doesn't matter what the original encoding was, Java always uses Unicode
FileOutputStream fos = new FileOutputStream("test.txt"); // Open file
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-16"); // Specify character encoding
PrintWriter pw = new PrintWriter(osw);
pw.print(text ); // repeat as needed
pw.close(); // cleanup
osw.close();
fos.close();
If your problem is something else please edit your question and provide more details.

You need to specify the correct encoding when the file is parsed - once you have a Java String based on the wrong encoding, it's too late.
JXL allows you to specify the encoding by passing a WorkbookSettings object to the factory method.

"windows-1252"/"Cp1252" is not required to be supported by JREs, but is by Sun's (and presumably most others). See the "Supported Encodings" in your JDK documentation. Then it's just a matter of using String, InputStreamReader or similar to decode the bytes into chars.

FileInputStream fis = new FileInputStream (yourFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis,"CP1250"));
And do with reader whatever you'd do directly with file.

Your description indicates that the encoding is UTF-8 and indeed C3 B6 is the UTF-8 encoding for 'ö'.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Generate CSV via Apache CSV in UTF-8 - java

Related

CSVPrinter with break line characters

Error Parsing due to CSV Differences Before/After Saving (Java w/ Apache Commons CSV)

CSV encoding specification

setting a UTF-8 in java and csv file [duplicate]

Convert from Codepage 1252 (Windows) to Java, in Java

Categories

Resources