Event Based POI reading Excel with hidden columns

Event Based POI reading Excel with hidden columns - java

I downloaded an excel file which has a hidden column (first column (A1)) along with other column values which are visible to the user. But when I am trying to read this excel using SAX event POI, it never processes the excel file. In startelement method, we do a check if(name.equals("c")) { ......} but this condition is never met when the excel file had a hidden column(The first column).
However when I make the hidden column visible, it reads(processes) the excel file. Another observation made is, once you make the hidden column visible and hide the column again, the excel file is read.
Please suggest.

Try the code here
I did it with an xlsx with a hidden column and the same xlsx with the column unhidden and both revealed all of the columns of the spreadsheet including cells in hidden columns. A bit long is the code, but I put it anyway in the event the link dies someday.
import java.io.InputStream;
import java.util.Iterator;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;
public class ExampleEventUserModel {
public void processOneSheet(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
// rId2 found by processing the Workbook
// Seems to either be rId# or rSheet#
InputStream sheet2 = r.getSheet("rId2");
InputSource sheetSource = new InputSource(sheet2);
parser.parse(sheetSource);
sheet2.close();
}
public void processAllSheets(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
Iterator<InputStream> sheets = r.getSheetsData();
while(sheets.hasNext()) {
System.out.println("Processing new sheet:\n");
InputStream sheet = sheets.next();
InputSource sheetSource = new InputSource(sheet);
parser.parse(sheetSource);
sheet.close();
System.out.println("");
}
}
public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException {
XMLReader parser =
XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser"
);
ContentHandler handler = new SheetHandler(sst);
parser.setContentHandler(handler);
return parser;
}
/**
* See org.xml.sax.helpers.DefaultHandler javadocs
*/
private static class SheetHandler extends DefaultHandler {
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
private SheetHandler(SharedStringsTable sst) {
this.sst = sst;
}
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => cell
if(name.equals("c")) {
// Print the cell reference
System.out.print(attributes.getValue("r") + " - ");
// Figure out if the value is an index in the SST
String cellType = attributes.getValue("t");
if(cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
}
// Clear contents cache
lastContents = "";
}
public void endElement(String uri, String localName, String name)
throws SAXException {
// Process the last contents as required.
// Do now, as characters() may be called more than once
if(nextIsString) {
int idx = Integer.parseInt(lastContents);
lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
nextIsString = false;
}
// v => contents of a cell
// Output after we've seen the string contents
if(name.equals("v")) {
System.out.println(lastContents);
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
lastContents += new String(ch, start, length);
}
}
public static void main(String[] args) throws Exception {
FromHowTo howto = new FromHowTo();
howto.processOneSheet(args[0]);
howto.processAllSheets(args[0]);
}
}

Related

Apache POI reads wrong date format of Excel sheet by the XSSF reader

I want to read specific value of cell in Excel sheet.Currently my code is giving General value instead of Custom value of cell.
For cite an Example :-
Now,
I am giving the input in Excel sheet Custom(Cell type)--> 01/10/2250.
output is converting in General value of Cell--->128110
(The cell is in format 'General', this is an important point becuse it is this specific condition that causes the error we have).
We are using XSSF reader and XMLReader and now Workbook because of memory issues with workbook. (we see alot of answers on the we for workbook not for XSSF)
Test cases
input --> 01/10/2250
Desired output -->01/10/2250
Actual output---->128110
input --> 12-25-2250
Desired output -->12-25-2250
Actual output---->128110
input --> 2250-25-12
Desired output -->2250-25-12
Actual output---->128195
Please mind all the input pattern with the hyphen(-) and slash(/).
strong text
This is code---->
import java.io.InputStream;
import java.util.Iterator;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;
public class DateCellDemo {
public void processOneSheet(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
InputStream sheet2 = r.getSheet("rId1");
InputSource sheetSource = new InputSource(sheet2);
parser.parse(sheetSource);
sheet2.close();
}
public void processAllSheets(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
Iterator<InputStream> sheets = r.getSheetsData();
while(sheets.hasNext()) {
System.out.println("Processing new sheet:\n");
InputStream sheet = sheets.next();
InputSource sheetSource = new InputSource(sheet);
parser.parse(sheetSource);
sheet.close();
System.out.println("");
}
}
public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException {
XMLReader parser =
XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser"
);
ContentHandler handler = new SheetHandler(sst);
parser.setContentHandler(handler);
return parser;
}
private static class SheetHandler extends DefaultHandler {
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
private SheetHandler(SharedStringsTable sst) {
this.sst = sst;
}
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => cell
if(name.equals("c")) {
// Print the cell reference
System.out.print(attributes.getValue("r") + " - ");
// Figure out if the value is an index in the SST
String cellType = attributes.getValue("t");
//System.out.println("------ cellType :"+cellType);
if(cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
}
// Clear contents cache
lastContents = "";
}
public void endElement(String uri, String localName, String name)
throws SAXException {
// Process the last contents as required.
// Do now, as characters() may be called more than once
if(nextIsString) {
System.out.println("Date Pattren : "+nextIsString);
int idx = Integer.parseInt(lastContents);
lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
}
// v => contents of a cell
// Output after we've seen the string contents
if(name.equals("v")) {
System.out.println("Last_Content : "+lastContents);
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
lastContents += new String(ch, start, length);
}
}
public static void main(String[] args) {
try {
DateCellDemo howto = new DateCellDemo();
howto.processOneSheet("C:\\Users\\Dell\\Downloads\\Source11_Data.xlsx");
} catch (Exception e) {
e.printStackTrace();
}
}
}

There is another Apache POI example (XLSX2CSV) that lets you stream the input xlsx but get access to formatted cell data. Your code is based on another sample which reads the raw XML from the xlsx file (and the raw XML has the dates in numeric format).
This is based on that sample:
https://github.com/pjfanning/poi-shared-strings-sample/blob/master/src/main/java/com/github/pjfanning/poi/sample/XLSX2CSV.java

gettting exception while reading duplicate column name excel file using sparkexcel library. How to overcome this issue

I am using spark-excel(com.crealytics.spark.excel) library to read excel file. If no duplicate column in excel file, the library working fine. If any duplicate column name occurs in excel file, throwing below exception.
How to overcome this error?
Is there any workaround solution to overcome this?
Exception in thread "main" org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `net territory`;
at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)

Using spark excel API getting exception .
StructType schema = DataTypes.createStructType(new StructField[]{DataTypes.createStructField("CGISAI", DataTypes.StringType, true), DataTypes.createStructField("SALES TERRITORY", DataTypes.StringType, true)});
SQLContext sqlcxt = new SQLContext(jsc);
Dataset<Row> df = sqlcxt.read()
.format("com.crealytics.spark.excel")
.option("path", "file:///"+siteinfofile)
.option("useHeader", "true")
.option("spark.read.simpleMode", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "false")
.option("addColorColumns", "False")
.option("sheetName", "sheet1")
.option("startColumn", 22)
.option("endColumn", 23)
//.schema(schema)
.load();
return df;
This is the code I am using. I am using sparkexcel library from com.crealytics.spark.excel.
I want the solution to identify whether excel file has duplicate columns or not. if have duplicate columns, how to rename/eliminate the duplicate columns.
WorkAround is as below:
convert .xlsx file into .csv . using spark default csv api that can handle duplicate column names by renaming them automatically.
Below is the code to convert from xlsx to csv file.
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package com.huawei.java.tools;
/**
*
* #author Nanaji Jonnadula
*/
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.util.CellAddress;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.util.SAXHelper;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFComment;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import javax.xml.parsers.ParserConfigurationException;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import static org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
public class ExcelXlsx2Csv {
private static class SheetToCSV implements SheetContentsHandler {
private boolean firstCellOfRow = false;
private int currentRow = -1;
private int currentCol = -1;
private StringBuffer lineBuffer = new StringBuffer();
/** * Destination for data */
private FileOutputStream outputStream;
public SheetToCSV(FileOutputStream outputStream) {
this.outputStream = outputStream;
}
#Override
public void startRow(int rowNum) {
/** * If there were gaps, output the missing rows: * outputMissingRows(rowNum - currentRow - 1); */
// Prepare for this row
firstCellOfRow = true;
currentRow = rowNum;
currentCol = -1;
lineBuffer.delete(0, lineBuffer.length()); //clear lineBuffer
}
#Override
public void endRow(int rowNum) {
lineBuffer.append('\n');
try {
outputStream.write(lineBuffer.substring(0).getBytes());
} catch (IOException e) {
System.out.println("save date to file error at row number: {}"+ currentCol);
throw new RuntimeException("save date to file error at row number: " + currentCol);
}
}
#Override
public void cell(String cellReference, String formattedValue, XSSFComment comment) {
if (firstCellOfRow) {
firstCellOfRow = false;
} else {
lineBuffer.append(',');
}
// gracefully handle missing CellRef here in a similar way as XSSFCell does
if (cellReference == null) {
cellReference = new CellAddress(currentRow, currentCol).formatAsString();
}
int thisCol = (new CellReference(cellReference)).getCol();
int missedCols = thisCol - currentCol - 1;
if (missedCols > 1) {
lineBuffer.append(',');
}
currentCol = thisCol;
if (formattedValue.contains("\n")) {
formattedValue = formattedValue.replace("\n", "");
}
formattedValue = "\"" + formattedValue + "\"";
lineBuffer.append(formattedValue);
}
#Override
public void headerFooter(String text, boolean isHeader, String tagName) {
// Skip, no headers or footers in CSV
}
}
private static void processSheet(StylesTable styles, ReadOnlySharedStringsTable strings,
SheetContentsHandler sheetHandler, InputStream sheetInputStream) throws Exception {
DataFormatter formatter = new DataFormatter();
InputSource sheetSource = new InputSource(sheetInputStream);
try {
XMLReader sheetParser = SAXHelper.newXMLReader();
ContentHandler handler = new XSSFSheetXMLHandler(
styles, null, strings, sheetHandler, formatter, false);
sheetParser.setContentHandler(handler);
sheetParser.parse(sheetSource);
} catch (ParserConfigurationException e) {
throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
}
}
public static void process(String srcFile, String destFile,String sheetname_) throws Exception {
File xlsxFile = new File(srcFile);
OPCPackage xlsxPackage = OPCPackage.open(xlsxFile.getPath(), PackageAccess.READ);
ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(xlsxPackage);
XSSFReader xssfReader = new XSSFReader(xlsxPackage);
StylesTable styles = xssfReader.getStylesTable();
XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
int index = 0;
while (iter.hasNext()) {
InputStream stream = iter.next();
String sheetName = iter.getSheetName();
System.out.println(sheetName + " [index=" + index + "]");
if(sheetName.equals(sheetname_)){
FileOutputStream fileOutputStream = new FileOutputStream(destFile);
processSheet(styles, strings, new SheetToCSV(fileOutputStream), stream);
fileOutputStream.flush();
fileOutputStream.close();
}
stream.close();
++index;
}
xlsxPackage.close();
}
public static void main(String[] args) throws Exception {
ExcelXlsx2Csv.process("D:\\data\\latest.xlsx", "D:\\data\\latest.csv","sheet1"); //source , destination, sheetname
}
}

How to read specific columns using Apache POI Event API?

I want to read large xls or xlsx file i.e. 70 MB file having 1,50,000+ rows.
I am able to load the entire excel file using below code, but I would like to fetch specific columns only like column C only or some other column.
import java.io.InputStream;
import java.util.Iterator;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;
public class testclass {
public void processOneSheet(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
// To look up the Sheet Name / Sheet Order / rID,
// you need to process the core Workbook stream.
// Normally it's of the form rId# or rSheet#
InputStream sheet2 = r.getSheet("rId2");
InputSource sheetSource = new InputSource(sheet2);
parser.parse(sheetSource);
sheet2.close();
}
public void processAllSheets(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
Iterator<InputStream> sheets = r.getSheetsData();
while(sheets.hasNext()) {
System.out.println("Processing new sheet:\n");
InputStream sheet = sheets.next();
InputSource sheetSource = new InputSource(sheet);
parser.parse(sheetSource);
sheet.close();
System.out.println("");
}
}
public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException {
XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
ContentHandler handler = (ContentHandler) new SheetHandler(sst);
parser.setContentHandler(handler);
return parser;
}
/**
* See org.xml.sax.helpers.DefaultHandler javadocs
*/
private static class SheetHandler extends DefaultHandler {
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
private SheetHandler(SharedStringsTable sst) {
this.sst = sst;
}
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => cell
if(name.equals("c")) {
// Print the cell reference
//System.out.print(attributes.getValue("r") + " - ");
// Figure out if the value is an index in the SST
String cellType = attributes.getValue("t");
if(cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
}
// Clear contents cache
lastContents = "";
}
public void endElement(String uri, String localName, String name)
throws SAXException {
// Process the last contents as required.
// Do now, as characters() may be called more than once
if(nextIsString) {
int idx = Integer.parseInt(lastContents);
lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
nextIsString = false;
}
// v => contents of a cell
// Output after we've seen the string contents
if(name.equals("v")) {
System.out.println(lastContents);
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
lastContents += new String(ch, start, length);
}
}
public static void main(String[] args) throws Exception {
String file_path = "D:/Data/Finance/Finance.xlsx";
testclass example = new testclass();
example.processAllSheets(file_path);
}
}

Reading multiple tabs of a huge excel file in Java using XSS and Event

I am using this code from (by author: lchen) which reads contents from excel file based on number of rows I provide into method 'readRow()'.
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.InputSource;
public class TestLargeFileRead {
private int rowNum = 0;
private OPCPackage opcPkg;
private ReadOnlySharedStringsTable stringsTable;
private XMLStreamReader xmlReader;
public void XExcelFileReader(String excelPath) throws Exception {
opcPkg = OPCPackage.open(excelPath, PackageAccess.READ);
this.stringsTable = new ReadOnlySharedStringsTable(opcPkg);
XSSFReader xssfReader = new XSSFReader(opcPkg);
XMLInputFactory factory = XMLInputFactory.newInstance();
InputStream inputStream = xssfReader.getSheetsData().next();
xmlReader = factory.createXMLStreamReader(inputStream);
while (xmlReader.hasNext()) {
xmlReader.next();
if (xmlReader.isStartElement()) {
if (xmlReader.getLocalName().equals("sheetData"))
break;
}
}
}
public int rowNum() {
return rowNum;
}
public List<String[]> readRows(int batchSize) throws XMLStreamException {
String elementName = "row";
List<String[]> dataRows = new ArrayList<String[]>();
if (batchSize > 0) {
while (xmlReader.hasNext()) {
xmlReader.next();
if (xmlReader.isStartElement()) {
if (xmlReader.getLocalName().equals(elementName)) {
rowNum++;
dataRows.add(getDataRow());
if (dataRows.size() == batchSize)
break;
}
}
}
}
return dataRows;
}
private String[] getDataRow() throws XMLStreamException {
List<String> rowValues = new ArrayList<String>();
while (xmlReader.hasNext()) {
xmlReader.next();
if (xmlReader.isStartElement()) {
if (xmlReader.getLocalName().equals("c")) {
CellReference cellReference = new CellReference(
xmlReader.getAttributeValue(null, "r"));
// Fill in the possible blank cells!
while (rowValues.size() < cellReference.getCol()) {
rowValues.add("");
}
String cellType = xmlReader.getAttributeValue(null, "t");
rowValues.add(getCellValue(cellType));
}
} else if (xmlReader.isEndElement()
&& xmlReader.getLocalName().equals("row")) {
break;
}
}
return rowValues.toArray(new String[rowValues.size()]);
}
private String getCellValue(String cellType) throws XMLStreamException {
String value = ""; // by default
while (xmlReader.hasNext()) {
xmlReader.next();
if (xmlReader.isStartElement()) {
if (xmlReader.getLocalName().equals("v")) {
if (cellType != null && cellType.equals("s")) {
int idx = Integer.parseInt(xmlReader.getElementText());
return new XSSFRichTextString(
stringsTable.getEntryAt(idx)).toString();
} else {
return xmlReader.getElementText();
}
}
} else if (xmlReader.isEndElement()
&& xmlReader.getLocalName().equals("c")) {
break;
}
}
return value;
}
#Override
protected void finalize() throws Throwable {
if (opcPkg != null)
opcPkg.close();
super.finalize();
}
public static void main(String[] args) {
try {
TestLargeFileRead howto = new TestLargeFileRead();
howto.XExcelFileReader("D:\\TEMP_CATALOG\\H1.xlsx");
} catch (Exception e) {
e.printStackTrace();
}
}
}
But it reads only First SHEET's contents and discards other subsequent SHEETS. My requirement is to read SHEET name; and based on name read that SHEET's contents. Can anyone help me to customize this above code fetch SHEET NAME and their contents ? please ?

The key class you need to work with, and tweak your use of, is XSSFReader. If you take a look at the Javadocs for it, you'll see it provides an Iterator of InputStreams of all the sheets, and a way to get at the root Workbook stream.
If you want to access all the sheets, you need to change these lines:
InputStream inputStream = xssfReader.getSheetsData().next();
xmlReader = factory.createXMLStreamReader(inputStream);
Into something more like:
Iterator<InputStream> sheetsData = xssfReader.getSheetsData();
while (sheetsData.hasNext()) {
InputStream inputStream = sheetsData.next();
xmlReader = factory.createXMLStreamReader(inputStream);
....
}
If you also want to get the sheet name as well, you'll want to do something like what is shown in the Apache POI XLSX event-based text extractor
XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
while (sheetsData.hasNext()) {
InputStream inputStream = sheetsData.next();
String sheetName = iter.getSheetName();
if (sheetName.equalsIgnoreCase("TheSheetIWant")) {
xmlReader = factory.createXMLStreamReader(inputStream);
....
}
}
If you want to know more about doing this stuff, then one of the best examples, that's easy to read and follow, is XSSFEventBasedExcelExtractor that comes with Apache POI - read the code for that and learn!

Writing a large resultset to an Excel file using POI

This is sort of inline w/ Writing a large ResultSet to a File but the file in question is an Excel file.
I'm using the Apache POI library to write an Excel file with a large data set retrieved from a ResultSet object. The data could range from a few thousand records to about 1 million; not sure how this translates into file system bytes in Excel format.
The following is a test code I wrote to check out the time taken to write such a large result set and also the performance implication w.r.t CPU & Memory.
protected void writeResultsetToExcelFile(ResultSet rs, int numSheets, String fileNameAndPath) throws Exception {
BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(fileNameAndPath));
int numColumns = rs.getMetaData().getColumnCount();
Workbook wb = ExcelFileUtil.createExcelWorkBook(true, numSheets);
Row heading = wb.getSheetAt(0).createRow(1);
ResultSetMetaData rsmd = rs.getMetaData();
for(int x = 0; x < numColumns; x++) {
Cell cell = heading.createCell(x+1);
cell.setCellValue(rsmd.getColumnLabel(x+1));
}
int rowNumber = 2;
int sheetNumber = 0;
while(rs.next()) {
if(rowNumber == 65001) {
log("Sheet " + sheetNumber + "written; moving onto to sheet " + (sheetNumber + 1));
sheetNumber++;
rowNumber = 2;
}
Row row = wb.getSheetAt(sheetNumber).createRow(rowNumber);
for(int y = 0; y < numColumns; y++) {
row.createCell(y+1).setCellValue(rs.getString(y+1));
wb.write(bos);
}
rowNumber++;
}
//wb.write(bos);
bos.close();
}
Not much luck with the above code. The file which is created seems to grow rapidly (~70Mb per sec). So I stopped the execution after about 10 minutes (killed the JVM when the file reaches 7Gb) and tried to open the file in Excel 2007. The moment I open it, the file size becomes 8k(!) and only the header and the first row are created. Not sure what I'm missing here.
Any ideas?

Using SXSSF poi 3.8
package example;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.streaming.SXSSFSheet;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
public class SXSSFexample {
public static void main(String[] args) throws Throwable {
FileInputStream inputStream = new FileInputStream("mytemplate.xlsx");
XSSFWorkbook wb_template = new XSSFWorkbook(inputStream);
inputStream.close();
SXSSFWorkbook wb = new SXSSFWorkbook(wb_template);
wb.setCompressTempFiles(true);
SXSSFSheet sh = (SXSSFSheet) wb.getSheetAt(0);
sh.setRandomAccessWindowSize(100);// keep 100 rows in memory, exceeding rows will be flushed to disk
for(int rownum = 4; rownum < 100000; rownum++){
Row row = sh.createRow(rownum);
for(int cellnum = 0; cellnum < 10; cellnum++){
Cell cell = row.createCell(cellnum);
String address = new CellReference(cell).formatAsString();
cell.setCellValue(address);
}
}
FileOutputStream out = new FileOutputStream("tempsxssf.xlsx");
wb.write(out);
out.close();
}
}
It requires:
poi-ooxml-3.8.jar,
poi-3.8.jar,
poi-ooxml-schemas-3.8.jar,
stax-api-1.0.1.jar,
xml-apis-1.0.b2.jar,
xmlbeans-2.3.0.jar,
commons-codec-1.5.jar,
dom4j-1.6.1.jar
Useful link

Oh. I think you're writing the workbook out 944,000 times. Your wb.write(bos) call is in the inner loop. I'm not sure this is quite consistent with the semantics of the Workbook class? From what I can tell in the Javadocs of that class, that method writes out the entire workbook to the output stream specified. And it's gonna write out every row you've added so far once for every row as the thing grows.
This explains why you're seeing exactly 1 row, too. The first workbook (with one row) to be written out to the file is all that is being displayed - and then 7GB of junk thereafter.

Unless you have to write formulas or formatting you should consider writing out a .csv file. Infinitely simpler, infinitely faster, and Excel will do the conversion to .xls or .xlsx automatically and correctly by definition.

You can using SXSSFWorkbook implementation of Workbook, if you use style in your excel ,You can caching style by Flyweight Pattern to improve your performance.

You can increase the performance of excel export by following these steps:
1) When you fetch data from database, avoid casting the result set to the list of entity classes. Instead assign it directly to List
List<Object[]> resultList =session.createSQLQuery("SELECT t1.employee_name, t1.employee_id ... from t_employee t1 ").list();
instead of
List<Employee> employeeList =session.createSQLQuery("SELECT t1.employee_name, t1.employee_id ... from t_employee t1 ").list();
2) Create excel workbook object using SXSSFWorkbook instead of XSSFWorkbook and create new row using SXSSFRow when the data is not empty.
3) Use java.util.Iterator to iterate the data list.
Iterator itr = resultList.iterator();
4) Write data into excel using column++.
int rowCount = 0;
int column = 0;
while(itr.hasNext()){
SXSSFRow row = xssfSheet.createRow(rowCount++);
Object[] object = (Object[]) itr.next();
//column 1
row.setCellValue(object[column++]); // write logic to create cell with required style in setCellValue method
//column 2
row.setCellValue(object[column++]);
itr.remove();
}
5) While iterating the list, write the data into excel sheet and remove the row from list using remove method. This is to avoid holding unwanted data from the list and clear the java heap size.
itr.remove();

For now I took #Gian's advice & limited the number of records per Workbook to 500k and rolled over the rest to the next Workbook. Seems to be working decent. For the above configuration, it took me about 10 mins per workbook.

I updated BigGridDemo to support multiple sheets.
BigExcelWriterImpl.java
package com.gdais.common.apache.poi.bigexcelwriter;
import static com.google.common.base.Preconditions.*;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.Enumeration;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
import java.util.zip.ZipOutputStream;
import javax.annotation.Nonnull;
import javax.annotation.Nullable;
import org.apache.commons.io.FilenameUtils;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import com.google.common.base.Function;
import com.google.common.collect.ImmutableList;
import com.google.common.collect.Iterables;
public class BigExcelWriterImpl implements BigExcelWriter {
private static final String XML_ENCODING = "UTF-8";
#Nonnull
private final File outputFile;
#Nullable
private final File tempFileOutputDir;
#Nullable
private File templateFile = null;
#Nullable
private XSSFWorkbook workbook = null;
#Nonnull
private LinkedHashMap<String, XSSFSheet> addedSheets = new LinkedHashMap<String, XSSFSheet>();
#Nonnull
private Map<XSSFSheet, File> sheetTempFiles = new HashMap<XSSFSheet, File>();
BigExcelWriterImpl(#Nonnull File outputFile) {
this.outputFile = outputFile;
this.tempFileOutputDir = outputFile.getParentFile();
}
#Override
public BigExcelWriter createWorkbook() {
workbook = new XSSFWorkbook();
return this;
}
#Override
public BigExcelWriter addSheets(String... sheetNames) {
checkState(workbook != null, "workbook must be created before adding sheets");
for (String sheetName : sheetNames) {
XSSFSheet sheet = workbook.createSheet(sheetName);
addedSheets.put(sheetName, sheet);
}
return this;
}
#Override
public BigExcelWriter writeWorkbookTemplate() throws IOException {
checkState(workbook != null, "workbook must be created before writing template");
checkState(templateFile == null, "template file already written");
templateFile = File.createTempFile(FilenameUtils.removeExtension(outputFile.getName())
+ "-template", ".xlsx", tempFileOutputDir);
System.out.println(templateFile);
FileOutputStream os = new FileOutputStream(templateFile);
workbook.write(os);
os.close();
return this;
}
#Override
public SpreadsheetWriter createSpreadsheetWriter(String sheetName) throws IOException {
if (!addedSheets.containsKey(sheetName)) {
addSheets(sheetName);
}
return createSpreadsheetWriter(addedSheets.get(sheetName));
}
#Override
public SpreadsheetWriter createSpreadsheetWriter(XSSFSheet sheet) throws IOException {
checkState(!sheetTempFiles.containsKey(sheet), "writer already created for this sheet");
File tempSheetFile = File.createTempFile(
FilenameUtils.removeExtension(outputFile.getName())
+ "-sheet" + sheet.getSheetName(), ".xml", tempFileOutputDir);
Writer out = null;
try {
out = new OutputStreamWriter(new FileOutputStream(tempSheetFile), XML_ENCODING);
SpreadsheetWriter sw = new SpreadsheetWriterImpl(out);
sheetTempFiles.put(sheet, tempSheetFile);
return sw;
} catch (RuntimeException e) {
if (out != null) {
out.close();
}
throw e;
}
}
private static Function<XSSFSheet, String> getSheetName = new Function<XSSFSheet, String>() {
#Override
public String apply(XSSFSheet sheet) {
return sheet.getPackagePart().getPartName().getName().substring(1);
}
};
#Override
public File completeWorkbook() throws IOException {
FileOutputStream out = null;
try {
out = new FileOutputStream(outputFile);
ZipOutputStream zos = new ZipOutputStream(out);
Iterable<String> sheetEntries = Iterables.transform(sheetTempFiles.keySet(),
getSheetName);
System.out.println("Sheet Entries: " + sheetEntries);
copyTemplateMinusEntries(templateFile, zos, sheetEntries);
for (Map.Entry<XSSFSheet, File> entry : sheetTempFiles.entrySet()) {
XSSFSheet sheet = entry.getKey();
substituteSheet(entry.getValue(), getSheetName.apply(sheet), zos);
}
zos.close();
out.close();
return outputFile;
} finally {
if (out != null) {
out.close();
}
}
}
private static void copyTemplateMinusEntries(File templateFile,
ZipOutputStream zos, Iterable<String> entries) throws IOException {
ZipFile templateZip = new ZipFile(templateFile);
#SuppressWarnings("unchecked")
Enumeration<ZipEntry> en = (Enumeration<ZipEntry>) templateZip.entries();
while (en.hasMoreElements()) {
ZipEntry ze = en.nextElement();
if (!Iterables.contains(entries, ze.getName())) {
System.out.println("Adding template entry: " + ze.getName());
zos.putNextEntry(new ZipEntry(ze.getName()));
InputStream is = templateZip.getInputStream(ze);
copyStream(is, zos);
is.close();
}
}
}
private static void substituteSheet(File tmpfile, String entry,
ZipOutputStream zos)
throws IOException {
System.out.println("Adding sheet entry: " + entry);
zos.putNextEntry(new ZipEntry(entry));
InputStream is = new FileInputStream(tmpfile);
copyStream(is, zos);
is.close();
}
private static void copyStream(InputStream in, OutputStream out) throws IOException {
byte[] chunk = new byte[1024];
int count;
while ((count = in.read(chunk)) >= 0) {
out.write(chunk, 0, count);
}
}
#Override
public Workbook getWorkbook() {
return workbook;
}
#Override
public ImmutableList<XSSFSheet> getSheets() {
return ImmutableList.copyOf(addedSheets.values());
}
}
SpreadsheetWriterImpl.java
package com.gdais.common.apache.poi.bigexcelwriter;
import java.io.IOException;
import java.io.Writer;
import java.util.Calendar;
import org.apache.poi.ss.usermodel.DateUtil;
import org.apache.poi.ss.util.CellReference;
class SpreadsheetWriterImpl implements SpreadsheetWriter {
private static final String XML_ENCODING = "UTF-8";
private final Writer _out;
private int _rownum;
SpreadsheetWriterImpl(Writer out) {
_out = out;
}
#Override
public SpreadsheetWriter closeFile() throws IOException {
_out.close();
return this;
}
#Override
public SpreadsheetWriter beginSheet() throws IOException {
_out.write("<?xml version=\"1.0\" encoding=\""
+ XML_ENCODING
+ "\"?>"
+
"<worksheet xmlns=\"http://schemas.openxmlformats.org/spreadsheetml/2006/main\">");
_out.write("<sheetData>\n");
return this;
}
#Override
public SpreadsheetWriter endSheet() throws IOException {
_out.write("</sheetData>");
_out.write("</worksheet>");
closeFile();
return this;
}
/**
* Insert a new row
*
* #param rownum
* 0-based row number
*/
#Override
public SpreadsheetWriter insertRow(int rownum) throws IOException {
_out.write("<row r=\"" + (rownum + 1) + "\">\n");
this._rownum = rownum;
return this;
}
/**
* Insert row end marker
*/
#Override
public SpreadsheetWriter endRow() throws IOException {
_out.write("</row>\n");
return this;
}
#Override
public SpreadsheetWriter createCell(int columnIndex, String value, int styleIndex)
throws IOException {
String ref = new CellReference(_rownum, columnIndex).formatAsString();
_out.write("<c r=\"" + ref + "\" t=\"inlineStr\"");
if (styleIndex != -1) {
_out.write(" s=\"" + styleIndex + "\"");
}
_out.write(">");
_out.write("<is><t>" + value + "</t></is>");
_out.write("</c>");
return this;
}
#Override
public SpreadsheetWriter createCell(int columnIndex, String value) throws IOException {
createCell(columnIndex, value, -1);
return this;
}
#Override
public SpreadsheetWriter createCell(int columnIndex, double value, int styleIndex)
throws IOException {
String ref = new CellReference(_rownum, columnIndex).formatAsString();
_out.write("<c r=\"" + ref + "\" t=\"n\"");
if (styleIndex != -1) {
_out.write(" s=\"" + styleIndex + "\"");
}
_out.write(">");
_out.write("<v>" + value + "</v>");
_out.write("</c>");
return this;
}
#Override
public SpreadsheetWriter createCell(int columnIndex, double value) throws IOException {
createCell(columnIndex, value, -1);
return this;
}
#Override
public SpreadsheetWriter createCell(int columnIndex, Calendar value, int styleIndex)
throws IOException {
createCell(columnIndex, DateUtil.getExcelDate(value, false), styleIndex);
return this;
}
#Override
public SpreadsheetWriter createCell(int columnIndex, Calendar value)
throws IOException {
createCell(columnIndex, value, -1);
return this;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Event Based POI reading Excel with hidden columns - java

Related

Apache POI reads wrong date format of Excel sheet by the XSSF reader

gettting exception while reading duplicate column name excel file using sparkexcel library. How to overcome this issue

How to read specific columns using Apache POI Event API?

Reading multiple tabs of a huge excel file in Java using XSS and Event

Writing a large resultset to an Excel file using POI

Categories

Resources