I have an excel sheet with some non-english characters in it and when I try to grab the contents via
sheet.getColumn(column)[row].getContents()
It returns the string with the replacement character \uFFFD instead of the non-english character which I was going to then translate to unicode using StringEscapeUtils.escapeJava.
//"L\u00F6schen" - correct
return StringEscapeUtils.escapeJava("Löschen");
//"L\uFFFDschen" - incorrect
return StringEscapeUtils.escapeJava(sheet.getColumn(column)[row].getContents());
//"L�schen" - incorrect
System.out.print(sheet.getColumn(column)[row].getContents());
This was really frustrating and it seems that jexcelapi is missing a lot of support.
Went with Apache POI instead and it worked great with no issues.
Try to set encoding through WorkbookSettings when initializing Workbook.
For example:
WorkbookSettings settings = new WorkbookSettings();
settings.setEncoding("Your java charset name");
Workbook workbook = Workbook.getWorkbook(source, settings);
Then getContents() method should correct content of cell
Related
In my Excel file that I am trying to convert using Apache POI, I have a cell that has numeric value as -3.97819466831428 and Custom format as "0.0 p.p.;(0.0 p.p.)". So, in Excel the value that is displayed is "(4.0 p.p.)"
When I convert the same using POI library, I get the output as: "(4.0 p"
How can I get the same value as in Excel: (4.0 p.p.) ?
The way I am using DataFormatter is:
val = dataFormatter.formatRawCellContents(cell.getNumericCellValue(), style.getDataFormat(), style.getDataFormatString());
I believe the problem is coming from the usage of "p.p." in the data format string, especially the dots. When I print the data format string from POI using style.getDataFormatString(), I get the format as "0.0\ \p.\p.;(0.0\ \p.\p.\)".
Even if I manually change the format string to use "0.0;(0.0\ \p\.\p\.\)", still its the same result. So, I am out of ideas now. How can I get the full result back from data-formatter like in Excel as "(4.0 p.p.)" ?
Another question that I have is: Is it possible using Apache POI to get the actual displayed value in Excel file? Like in this case, is it possible to get the value "(4.0 p.p.)" directly from Excel without having to apply any data formatting in POI?
This is an error in apache poi's DataFormatter while translating the Excel number format 0.0\ \p.\p.;\(0.0\ \p.\p.\) into a java.text.Format. The correct corresponding java.text.Format would be a new java.text.DecimalFormat("0.0' p.p.';(0.0' p.p.')"). But apache poi's DataFormatter fails to translate this properly.
You should file a bug report to apache poi about this. In that bug report you should provide a working example (full Java code and a sample Excel file) to reproduce the issue.
As a workaround one can tell DataFormatter how single special Excel number formats shall be translated. For this use the method DataFormatter.addFormat.
Example:
import org.apache.poi.ss.usermodel.*;
import java.io.FileInputStream;
class DataFormatterAddFormat {
public static void main(String[] args) throws Exception {
Workbook workbook = WorkbookFactory.create(new FileInputStream("ExcelExample.xlsx"));
DataFormatter dataFormatter = new DataFormatter();
FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
dataFormatter.addFormat("0.0 p.p.;(0.0 p.p.)",
new java.text.DecimalFormat("0.0' p.p.';(0.0' p.p.')"));
dataFormatter.addFormat("0.0\" p.p.\";\\(0.0\" p.p.\"\\)",
new java.text.DecimalFormat("0.0' p.p.';(0.0' p.p.')"));
dataFormatter.addFormat("0.0\\ \\p.\\p.;\\(0.0\\ \\p.\\p.\\)",
new java.text.DecimalFormat("0.0' p.p.';(0.0' p.p.')"));
Sheet sheet = workbook.getSheetAt(0);
for (Row row : sheet) {
for (Cell cell : row) {
String value = dataFormatter.formatCellValue(cell, formulaEvaluator);
System.out.println(value);
}
}
workbook.close();
}
}
This is now able correct translating the added special number formats. Of course this is not really the final solution since the need here is to catch all possible Excel formats which have to be translated. That's why the hint to file a bug report to apache poi.
Btw.: The Excel format 0.0" p.p.";\(0.0" p.p."\) would be more general for this. It avoids confusing the dot (.) in p.p. with the decimal separator.
To your question about getting the formatted value directly from the Excel file: This is not possible. All Excel versions store values and styles separate. Numeric values are always stored as floating point values in double precision. Number formats for those values are stored in a separate styles section of the file. So best practice to get cell values styled as in Excel using apache poi is using DataFormatter.formatCellValue as shown in my code sample.
I am getting question mark symbol(?) instead of multiple white spaces in output excel. I am using apache poi 3.7. For single space it is working fine.
For example:-
if my input is "a b" then generated output is "a? b".
Here a and b have two spaces in between.
This code snippet works just fine.
Can you compare with your own code and post some code sample if you still have the problem ?
Workbook book = new HSSFWorkbook();
Sheet sheet = book.createSheet();
Row oRow = sheet.createRow(1);
Cell oCell = oRow.createCell(1);
oCell.setCellValue("a b");
OutputStream out = new FileOutputStream("c:\\temp\\test.xls");
book.write(out);
out.close();
Try to open your generated spreadsheet output in Microsoft Excel.
It is encoding issue. Sometimes it might happen that if your input contains multiple white spaces then Open office shows you as "?".
For future reference, this solved my problem. As Eric pointed, one should find out first which character codes are creating trouble, in my particular case they where zeroes.
String s = getStringFromSource();
s = s.replace('\u0000', '\u0020'); // check values with dec to hexa first, u0020 means 32
cell.setValue(s);
I have a problem of character encoding while using JExcel.
My app creates an excel document from a template and fills it with with data from a database (filled with current and previous sessions user-input) before sending it to the user.
In the final document, non-ASCII characters FROM THE TEMPLATE such as é, è, à, or ° are not rendered properly (in the generated document, they appear properly in the template) and are instead replaced by � while those from the database are properly encoded.
I use UTF-8 for user input (and output to the viewing layer) as well as database storage.
I use this code in the class that generates the file:
private void createFile(Arguments...)
throws IOException, BiffException, RowsExceededException, WriteException
{
File XLSFile = new File(MyPath);
WorkbookSettings XLSSettings = new WorkbookSettings()
XLSSettings.setEncoding(Constants.TEMPLATE_ENCODING)
// Constants.java is a class containing only app-wide constants declared as public static final
Workbook template = Workbook.getWorkbook(
new File(Constants.TEMPLATE_PATH));
WritableWorkbook userDocument =
Workbook.createWorkBook(XLSFile,template,XLSSettings);
template.close();
WritableSheet sheet = userDocument.getSheet(0);
...
Code that fills my workbook and sheet by creating new Labels and
adding them to my WritableSheet with sheet.add(Label)
...
userDocument.write();
userDocument.close();
}
Constants.TEMPLATE_ENCODING has been set to "Cp1252" as was suggested in this question: Encoding problem in JExcel but to no avail however.
Trying to change it to "UTF-8" produced no visible change either.
The application works otherwise just fine at every level.
I figured it might be a problem of setting the proper encoding when opening and copying the template and tried to change this line
Workbook template = Workbook.getWorkbook(new File(Constants.TEMPLATE_PATH);
to
Workbook template = Workbook.getWorkbook(new File(Constants.TEMPLATE_PATH, XLSSettings);
but it produces an ArrayOutOfBoundException in java.lang.System.arraycopy propagating from this line userDocument.write(); via
java.lang.ArrayIndexOutOfBoundsException
java.lang.System.arraycopy(Native Method)
jxl.biff.StringHelper.getBytes(StringHelper.java:127)
jxl.write.biff.WriteAccessRecord.<init>(WriteAccessRecord.java:59)
jxl.write.biff.WritableWorkbookImpl.write(WritableWorkbookImpl.java:726)
com.mypackage.MyClass.createFile(MyClass.java:337)
Anyone ever encountered the problem and know how to fix it ?
I was facing this problem too. The solution, for me, was very easy. I just had to put my WorkbookSettings just in the TEMPLATE wb and not in the new file.
//Load template workbook with settings
WorkbookSettings ws = new WorkbookSettings();
ws.setEncoding("Cp1252");
Workbook templateWorkbook = Workbook.getWorkbook(this.context.getAssets().open("template.xls"), ws);
//Create new workbook from templateWorkbook without settings
this.workbook = Workbook.createWorkbook(new File(this.location), templateWorkbook);
Found at: Android and JXL : ArrayIndexOutOfBoundException when create WritableWorkbook
Regards
I am creating a CSV and writing content in UTF-8 to support German and English by specifying encoding as below
BufferedWriter outFile = new BufferedWriter( new OutputStreamWriter( outputStream, "UTF-8" ) );
The above is working fine till I add the below separator indication (;) in the header of CSV
outFile.write( "sep=;" );
outFile.newLine();
Without this delimiter ; my CSV will be wrong but when I inclde this the encoding is failing and UTf-8 not in place.
Is there any other keyword like "sep=" to specify in header of CSV to specify encoding?
I tried encoding="UTF-8" and it is not working.
Thanks.
You cannot open a UTF8 csv file with Excel 2007. Microsft have no understanding of the word "standards". Because of this, it is notoriously difficult to generate a csv file which opens in every possible application that reads .csv files and keeps the correct encoding.
If you must use Excel 2007, I would suggest using encoding with Microsofts own "windows 1252" as it supports German characters. Don't use the header, and also look in to using tab as a separator. Yes I know the c stands for comma, but tab seems to be more consistent with Excel 2007 if you save the file back again.
I have some strings in Java (originally from an Excel sheet) that I presume are in Windows 1252 codepage. I want them converted to Javas own unicode format. The Excel file was parsed using the JXL package, in case that matter.
I will clarify: apparently the strings gotten from the Excel file look pretty much like it already is some kind of unicode.
WorkbookSettings ws = new WorkbookSettings();
ws.setCharacterSet(someInteger);
Workbook workbook = Workbook.getWorkbook(new File(filename), ws);
Sheet s = workbook.getSheet(sheet);
row = s.getRow(4);
String contents = row[0].getContents();
This is where contents seems to contain something unicode, the åäö are multibyte characters, while the ASCII ones are normal single byte characters. It is most definitely not Latin1. If I print the "contents" string with printLn and redirect it to a hello.txt file, I find that the letter "ö" is represented with two bytes, C3 B6 in hex. (195 and 179 in decimal.)
[edit]
I have tried the suggestions with different codepages etc given below, tried converting from Cp1252 etc. There was some kind of conversion, because I would get some other kind of gibberish instead. As reference I always printed an "ö" string hand coded into the source code, to verify that there was not something wrong with my terminal or typefaces or anything. The manually typed "ö" always worked.
[edit]
I also tried WorkBookSettings as suggested in the comments, but I looked in the code for JXL and characterSet seems to be ignored by parsing code. I think the parsing code just looks at whatever encoding the XLS file is supposed to be in.
WorkbookSettings ws = new WorkbookSettings();
ws.setEncoding("CP1250");
Worked for me.
If none of the answer above solve the problem, the trick might be done like this:
String myOutput = new String (myInput, "UTF-8");
This should decode the incoming string, whatever its format.
When Java parses a file it uses some encoding to read the bytes on the disk and create bytes in memory. The default encoding varies from platform to platform. Java's internal String representation is Unicode already, so if it parses the file with the right encoding then you are already done; just write out the data in any encoding you want.
If your strings appear corrupted when you look at them in Java, it is probably because you are using the wrong encoding to read the data. Excel is probably using UTF-16 (Little-Endian I think) but I'd expect a library like JXL should be able to detect it appropriately. I've looked at the Javadocs for JXL and it doesn't do anything with character encodings. I imagine it auto-detects any encodings as it needs to.
Do you just need to write the already loaded strings to a text file? If so, then something like the following will work:
String text = getCP1252Text(); // doesn't matter what the original encoding was, Java always uses Unicode
FileOutputStream fos = new FileOutputStream("test.txt"); // Open file
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-16"); // Specify character encoding
PrintWriter pw = new PrintWriter(osw);
pw.print(text ); // repeat as needed
pw.close(); // cleanup
osw.close();
fos.close();
If your problem is something else please edit your question and provide more details.
You need to specify the correct encoding when the file is parsed - once you have a Java String based on the wrong encoding, it's too late.
JXL allows you to specify the encoding by passing a WorkbookSettings object to the factory method.
"windows-1252"/"Cp1252" is not required to be supported by JREs, but is by Sun's (and presumably most others). See the "Supported Encodings" in your JDK documentation. Then it's just a matter of using String, InputStreamReader or similar to decode the bytes into chars.
FileInputStream fis = new FileInputStream (yourFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis,"CP1250"));
And do with reader whatever you'd do directly with file.
Your description indicates that the encoding is UTF-8 and indeed C3 B6 is the UTF-8 encoding for 'ö'.