I am using POI library to deal with docx, pptx, xlsx and so on. It is strange that MS turn all the embedded files such as .wma, .vsd into .bin. Is there any way to extract them back to the original file with proper extension rather than .bin?
Related
We have some legacy data in .xls (HSSF) format that we are converting to .xlsx (XSSF) format using Apache POI library. It was all working very well till we started seeing many org.apache.poi.poifs.filesystem.NotOLE2FileException. Upon closer examination we realized that the files that are throwing this exception are not actually Excel files (despite the misleading .xls extension) but Single File Web Page files (web archive X-Document-Type: Workbook).
Question) Is there any opensource Java library that converts "X-Document-Type: Workbook" to Excel?
Addendum: Clarification, as sought by #kiwiwings
No the files are not "XML Workbook" format. They are MIME documents with the X-Document-Type: Workbook declaration. Each part is a standard HTML file, with its own table.
The files are given the .xls extension and Excel is able to open them, albeit after issuing the following warning:
The file you are trying to open, 'blah-blah-blah.xls', is in a different format than specified by the file extension. Verify that the file is not corrupted and is from a trusted source before opening the file. Do you want to open the file now?
I'm learning about data driven testing using Selenium and Excel. I'm taking an online course that has asked used to add the Apache poi and poi-ooxml dependencies in Maven.
I'm struggling to understand what the differences between the two are. Are both required in order to retrieve data in Excel and pass these to our tests?
Thanks
Excel files has long history
Excel 97-2003 workbook:
This is a legacy Excel file that follows a binary file format. The file extension of the format is .xls.
Excel 97-2003 in terms of apache poi is called - Horrible Spreadsheet Format As the Excel file format is complex and contains a number of tricky characteristics,
apache-poi jar has code to handle these file
Excel 2007+ workbook:
This is the default XML-based file format for Excel 2007 and later versions. It follows the Office Open XML (OOXML) format, which is a zipped, XML-based file format developed by Microsoft for representing office documents. The file extension of the format is .xlsx. ( DOCX,PPTX are other OOXML based examples).
Excel 2007+ workbook in terms of apache poi is called - XML Spreadsheet Format -these file format are advanced version of HSSF and has additional features, code to handle these files are written in apache-poi-ooxml jar
More reading
As .xls is almost dead but still some applications use it, so for backward compatibility both dependencies are required.
here is what Apache have to say -
HSSF Excel XLS poi For HSSF only, if common SS is needed see below
Common SS Excel XLS and XLSX poi-ooxml WorkbookFactory and friends
all require poi-ooxml, not just core poi
you can read more at their official website http://poi.apache.org/components/index.html#components
When I create xlsx file with Apache POI sometimes (when the file is big) it creates such a file that can't be opened by this same Apache POI while MS Excel or LibreOffice Calc open it without problems.
When I try to open this workbook with Apache POI it says that
Zip bomb detected
I can open it only if I call ZipSecureFile.setMinInflateRatio(0) or resave it in LibreOffice (MS Excel doesn't help here).
How to fix this? Why POI creates file which it can't open?
Simply do as the error message suggests and set the limits differently via
ZipSecureFile.setMinInflateRatio(0)
You seem to have a rather special use-case which produces a file that is similar to some files that malicious users could use to make your server crash, use up CPU or go out of memory. To avoid this, Apache POI has this limit, but allows to set it differently if needed. So if the file is not coming from untrusted users, you can easily adjust these limits to avoid the error message.
Excel or LibreOffice might optimize the file-content more than Apache POI does and thus produce a file that does not reach these limits.
My application requires a reporting facility in excel/csv format. In case of large report, the generated CSV is corrupt. Though i am able to e-mail the generated CSV using smtp.
I tried changing the following with no lead, your help on this is appreciated
Change the library to POI
Changed the library to JXL
Monitored if there is a memory leakage
This is a web based application and the code is written in JSP.
POI is mainly for MS office formats like xls, xlsx, doc. JXL is also for xls files. You should use a framework which is for CSV like OpenCSV.
We are converting a C++ project to Java where we generate reports in ".doc" extension. The problem is we don't use any third party library to generate MS Word document, rather a file with .doc extension. Everything works fine except that we can't seem to find a way to add a Header at the beginning of every page. Using line numbers is not an option. Any other way it can be done?
Thank you.
The Apache POI library might be of some help.
It has facilities to read and modify Microsoft proprietary file formats like MS-Word .doc and MS-Excel .xls