Java: Read specific columns too memory consuming

Java: Read specific columns too memory consuming - java

I need to read a large (50000 row and 20 columns) excel file using Apache POI library. There is another question that asks exactly the same thing. My attempted approach is as follows:
public static ArrayList<Double> readColumn(String excelFile,String sheetName, int columnNumber)
{
ArrayList<Double> excelData = new ArrayList<>();
XSSFWorkbook workbook = null;
try
{
workbook = new XSSFWorkbook(excelFile);
} catch (IOException e)
{
e.printStackTrace();
}
Sheet sheet = workbook.getSheet(sheetName);
for (int i = 0; i <= sheet.getLastRowNum(); i++)
{
Row row = sheet.getRow(i);
if (row != null) {
Cell cell = row.getCell(columnNumber);
if (cell != null)
{
// Skip cellls that are not numericals
if (cell.getCellTypeEnum() == CellType.NUMERIC)
{
excelData.add(cell.getNumericCellValue());
System.out.println(cell.getNumericCellValue());
}
}
}
}
return excelData;
}
Unfortunately, while this method seems to work when accessing a low index column number (e.g. columnNumber =1), I get an OutOfMemoryError exception for a large columnNumber. The file itself is not too large to make my computer run out memory. I can achieve the same outcome in Python with very little memory requirements.Is there a better way to solve this? Or, is there any Java library that would allow me to do that?

Related

What is causing my program to bog down when writing to XSSF Workbook?

Maybe "writing" wasn't the correct word since in this function, I am just setting the cells and then writing afterwards.
I have a function that I have pin pointed to be the cause of it bogging down. When it gets to this function, it spends over 10 minutes here before I just terminate it.
This is the function that I am passing an output_wb to:
private static void buildRowsByListOfRows(int sheetNumber, ArrayList<Row> sheet, Workbook wb) {
Sheet worksheet = wb.getSheetAt(sheetNumber);
int lastRow;
Row row;
String cell_value;
Cell cell;
int x = 0;
System.out.println("Size of array list: " + sheet.size());
for (Row my_row : sheet) {
try {
lastRow = worksheet.getLastRowNum();
row = worksheet.createRow(++lastRow);
for (int i = 0; i < my_row.getLastCellNum(); i++) {
cell_value = getCellContentAsString(my_row.getCell(i, Row.MissingCellPolicy.CREATE_NULL_AS_BLANK));
cell = row.createCell(i);
cell.setCellValue(cell_value);
System.out.println("setting row #: " + x + "with value =>" + cell_value);
}
x++;
} catch (Exception e) {
System.out.println("SOMETHING WENT WRONG");
System.out.println(e);
}
}
}
The size of the ArrayList is 73,835. It starts off running pretty fast then it gets to around row 20,000 and it then you can see the print statements in the loop getting spread out further and further apart. Each row has 70 columns.
Is this function really written that poorly or is something else going on?
What can I do to optimize this?
I create the output workbook like this if this matters:
// Create output file with the required sheets
createOutputXLSFile(output_filename_path);
XSSFWorkbook output_wb = new XSSFWorkbook(new FileInputStream(output_filename_path));
And the createOutputXLSFile() looks like this:
private static void createOutputXLSFile(String output_filename_path) throws FileNotFoundException {
try {
// Directory path where the xls file will be created
// Create object of FileOutputStream
FileOutputStream fout = new FileOutputStream(output_filename_path);
XSSFWorkbook wb = new XSSFWorkbook();
wb.createSheet("Removed records");
wb.createSheet("Added records");
wb.createSheet("Updated records");
// Build the Excel File
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
wb.write(outputStream);
outputStream.writeTo(fout);
outputStream.close();
fout.close();
wb.close();
} catch (IOException e) {
e.printStackTrace();
}
}
private static String getCellContentAsString(Cell cell) {
DataFormatter fmt = new DataFormatter();
String data = null;
if (cell.getCellType() == CellType.STRING) {
data = String.valueOf(cell.getStringCellValue());
} else if (cell.getCellType() == CellType.NUMERIC) {
data = String.valueOf(fmt.formatCellValue(cell));
} else if (cell.getCellType() == CellType.BOOLEAN) {
data = String.valueOf(fmt.formatCellValue(cell));
} else if (cell.getCellType() == CellType.ERROR) {
data = String.valueOf(cell.getErrorCellValue());
} else if (cell.getCellType() == CellType.BLANK) {
data = String.valueOf(cell.getStringCellValue());
} else if (cell.getCellType() == CellType._NONE) {
data = String.valueOf(cell.getStringCellValue());
}
return data;
}
Update #1- Seems to be happening here. If I comment out all 3 lines then it finishes:
cell_value = getCellContentAsString(my_row.getCell(i, Row.MissingCellPolicy.CREATE_NULL_AS_BLANK));
cell = row.createCell(i);
cell.setCellValue(cell_value);
Update #2 - If I comment out these two lines, then the loop finishes as expected:
cell = row.createCell(i); // The problem
cell.setCellValue(cell_value);
So now I know the problem is the row.createCell(i) but why? How can I optimize this?

I finally managed to resolve this issue. Turns out that using XSSF to write is just too slow if the files are large. So I converted the XSSF output workbook to an SXSSFWorkbook. To do that I just passed in my already existing XSSFWorkbook into SXSSFWorkbook like this :
// Create output file with the required sheets
createOutputXLSFile(output_filename_path);
XSSFWorkbook output_wb_temp = new XSSFWorkbook(new FileInputStream(output_filename_path));
SXSSFWorkbook output_wb = new SXSSFWorkbook(output_wb_temp);
The rest of the code works as is.

Writing data to an excel template with many formula and styles and color takes too much time

I'm trying to write data to excel template which contains many formula and style. If i had formula/formatting copied to 100 rows*601 cells, excel downloads with in second after writing data but as soon as i stretch it to say 3000 rows*601 cells and even if i write it to 70 rows,it takes forever and hangs down the server. This happens because workbook.write(outputstream) takes too much time.
If i increase heap size it works for 1500 rows which takes around 15 to 20 sec to download.
I have tried SXSSFWorkbook, but this doesn't allow me to write in existing rows. To avoid this i keep one XSSFWorkbook to read data and SXSSFWorkbook to write, the problem is i was unable to get colors and style from my actual template to other, and data also get messed up.
This works but i have to increase heap size with increase in number of rows, i have tested it for 1500 rows resultant file size is 3.5 MB.
private Workbook dumpDataToMaster(List<Map<String, Object>> excelData,Map<String, String> fileType) throws IOException {
Workbook workbook = new XSSFWorkbook(this.getClass().getResourceAsStream("/excel_templates/"+fileType.get("file")));
Sheet sheet = workbook.getSheet(fileType.get("sheetName"));
Iterator<Cell> headerItr = sheet.getRow(3).iterator();
List<String> headers = new ArrayList<>();
while(headerItr.hasNext()) {
headers.add(headerItr.next().getStringCellValue());
}
int headersSize = headers.size();
int excelDataSize = excelData.size();
IntStream.range(0,excelDataSize).forEach(eIndex->{
Row dataRow = sheet.getRow(eIndex+4);
IntStream.range(0,headersSize).forEach(index->{
Object value = excelData.get(eIndex).get(headers.get(index));
if(value != null) {
if (value instanceof String) {
dataRow.getCell(index).setCellValue((String)value);
} else if(value instanceof Number) {
dataRow.getCell(index).setCellValue(((Number)value).doubleValue());
} else if(value instanceof Date) {
dataRow.getCell(index).setCellValue((Date)value);
}
}
});
});
sheet.getRow(1).getCell(1).setCellValue(Calendar.getInstance().getTime());
workbook.setForceFormulaRecalculation(true);
return workbook;
}
Controller for above method looks like this
public void downloadDDTemplate(#RequestParam String values, #OrgModelAttribute("sessionElements") SessionElements sessionElements, HttpServletResponse response) throws IOException {
OrderTemplateType orderTemplateType = objectMapper.readValue(values, OrderTemplateType.class);
response.setHeader(Constant.CONTENT_DISPOSITION,
"attachment; filename=\"" + orderTemplateType.getDisplay()+" - Template - "
+ DateTimeFormatConfiguration.getFileNameDateFormat().format(Calendar.getInstance().getTime())
+ ".xlsx\"");
response.setContentType("application/vnd.openxmlformats-officedocument.wordprocessingml.document");
orderMasterService.downloadDDTemplate(orderTemplateType).write(response.getOutputStream());
response.flushBuffer();
}
second method i have tried is
private Workbook dumpDataToMaster(List<Map<String, Object>> excelData,Map<String, String> fileType, HttpServletResponse response) throws IOException, InvalidFormatException {
Workbook workbook = new XSSFWorkbook(this.getClass().getResourceAsStream("/excel_templates/"+fileType.get("file")));
XSSFWorkbook readWb = new XSSFWorkbook(workbook);
Sheet readSheet = readWb.getSheet(fileType.get("sheetName"));
//inputStream.close();
SXSSFWorkbook writeWb = new SXSSFWorkbook();
writeWb.setCompressTempFiles(true);
SXSSFSheet writeSheet = writeWb.createSheet(fileType.get("sheetName"));//(SXSSFSheet) wb.getSheet(fileType.get("sheetName"));
writeSheet.setRandomAccessWindowSize(1600);
Iterator<Cell> headerItr = readSheet.getRow(3).iterator();
List<String> headers = new ArrayList<>();
CellStyle dateStyle = writeWb.createCellStyle();
CreationHelper createHelper = writeWb.getCreationHelper();
dateStyle.setDataFormat(
createHelper.createDataFormat().getFormat("dd-mm-yyyy"));
while(headerItr.hasNext()) {
String value = headerItr.next().getStringCellValue();
headers.add(value);
Row auxWriteRow = writeSheet.createRow(3);
Cell auxWriteCell = auxWriteRow.createCell(headers.size()-1);
auxWriteCell.setCellValue(value);
}
int headersSize = headers.size();
int excelDataSize = excelData.size();
IntStream.range(0,excelDataSize).forEach(eIndex->{
Row dataRow = readSheet.getRow(eIndex+4);
Row writeRow = writeSheet.createRow(eIndex+4);
IntStream.range(0,headersSize).forEach(index->{
Cell writeCell = writeRow.createCell(index);
Cell readCell = dataRow.getCell(index);
Object value = excelData.get(eIndex).get(headers.get(index));
if(readCell != null && CellType.FORMULA.equals(readCell.getCellType())) {
writeCell.setCellFormula(readCell.getCellFormula());
}
writeCell.getCellStyle();
if(value != null) {
if (value instanceof String) {
XSSFCellStyle newCellStyle = (XSSFCellStyle) writeCell.getSheet().getWorkbook().createCellStyle();
writeCell.setCellStyle(copyCellStyle((XSSFCell)readCell, newCellStyle));
writeCell.setCellType(readCell.getCellType());
writeCell.setCellValue((String)value);
} else if(value instanceof Number) {
XSSFCellStyle newCellStyle = (XSSFCellStyle) writeCell.getSheet().getWorkbook().createCellStyle();
writeCell.setCellStyle(copyCellStyle((XSSFCell)readCell, newCellStyle));
writeCell.setCellStyle(readCell.getCellStyle());
writeCell.setCellType(readCell.getCellType());
writeCell.setCellValue(((Number)value).doubleValue());
} else if(value instanceof Date) {
writeCell.setCellStyle(dateStyle);
writeCell.setCellType(readCell.getCellType());
writeCell.setCellValue((Date)value);
}
}
});
});
int lastRow = sheet.getLastRowNum();
for (int eIndex=0; eIndex<=lastRow; eIndex++) {
Row dataRow = sheet.getRow(eIndex+excelDataSize+4);
if(dataRow == null)
break;
sheet.removeRow(dataRow);
}
Cell cell = writeSheet.createRow(1).createCell(1);
cell.setCellStyle(dateStyle);
cell.setCellValue(Calendar.getInstance().getTime());
XSSFFormulaEvaluator.evaluateAllFormulaCells(workbook);
writeWb.setForceFormulaRecalculation(true);
writeWb.write(response.getOutputStream());
writeWb.close();
readWb.close();
return writeWb;
}
Is there any way to achieve this faster,with less memory footprint on consumption.

Apache POI copy one row from Excel file to another new Excel File

I am using JAVA 8 and Apache POI 3.17. I have an Excel file and i want to keep only few lines and delete the others. But my Excel have 40K rows and deleting them one by one is quite long (nearly 30 min :/ )
So i try to change my way of doing it. Now i think it's better to only take rows that i need in the excel source and copy to another new one. But what i have tried so far is not efficient.
I have all my rows and want to keep in a List. But this not working and create me a blank excel :
public void createExcelFileFromLog (Path logRejetFilePath, Path fichierInterdits) throws IOException {
Map<Integer, Integer> mapLigneColonne = getRowAndColumnInError(logRejetFilePath);
Workbook sourceWorkbook = WorkbookFactory.create(new File(fichierInterdits.toAbsolutePath().toString()));
Sheet sourceSheet = sourceWorkbook.getSheetAt(0);
List<Row> listLignes = new ArrayList<Row>();
// get Rows from source Excel
for (Map.Entry<Integer, Integer> entry : mapLigneColonne.entrySet()) {
listLignes.add(sourceSheet.getRow(entry.getKey()-1));
}
// The new Excel
Workbook workbookToWrite = new XSSFWorkbook();
Sheet sheetToWrite = workbookToWrite.createSheet("Interdits en erreur");
// Copy Rows
Integer i = 0;
for (Row row : listLignes) {
copyRow(sheetToWrite, row, i);
i++;
}
FileOutputStream fos = new FileOutputStream(config.getDossierTemporaire() + "Interdits_en_erreur.xlsx");
workbookToWrite.write(fos);
workbookToWrite.close();
sourceWorkbook.close();
}
private static void copyRow(Sheet newSheet, Row sourceRow, int newRowNum) {
Row newRow = newSheet.createRow(newRowNum);
newRow = sourceRow;
}
EDIT : Change the method of copyRow it's better but the date have weird format and blank cells from the original row are gone.
private static void copyRow(Sheet newSheet, Row sourceRow, int newRowNum) {
Row newRow = newSheet.createRow(newRowNum);
Integer i = 0;
for (Cell cell : sourceRow) {
if(cell.getCellTypeEnum() == CellType.NUMERIC) {
newRow.createCell(i).setCellValue(cell.getDateCellValue());
} else {
newRow.createCell(i).setCellValue(cell.getStringCellValue());
}
i++;
}
}
EDIT 2 : To keep blank cell
private static void copyRow(Sheet newSheet, Row sourceRow, Integer newRowNum, Integer cellToColor) {
Row newRow = newSheet.createRow(newRowNum);
//Integer i = 0;
int lastColumn = Math.max(sourceRow.getLastCellNum(), 0);
for(int i = 0; i < lastColumn; i++) {
Cell oldCell = sourceRow.getCell(i, Row.MissingCellPolicy.RETURN_BLANK_AS_NULL);
if(oldCell == null) {
newRow.createCell(i).setCellValue("");
} else if (oldCell.getCellTypeEnum() == CellType.NUMERIC) {
newRow.createCell(i).setCellValue(oldCell.getDateCellValue());
} else {
newRow.createCell(i).setCellValue(oldCell.getStringCellValue());
}
}
}

Inserting data from arraylist in chunks in excel file using apache poi

I have the arraylist of data in the following format :
ArrayList> listResultData. Now collection contains around 11k+ rows to be inserted in the excel.
When i insert these 11490 rows in excel it took 6 hrs to insert the records, that means its very bad performance issue. Is there anyway to insert the data in excel in chunks for 1000 rows at a time (means there should be something like executeBatch() in sql for inserting records). A row contains 4-5 columns also.
Following is the code i have been using :
public boolean setArrayListData(String sheetName, ArrayList<ArrayList<String>> listResultData) {
try {
fis = new FileInputStream(path);
workbook = new XSSFWorkbook(fis);
int index = workbook.getSheetIndex(sheetName);
if (index == -1)
return false;
sheet = workbook.getSheetAt(index);
int colNum = 0;
int rowNum = this.getRowCount(sheetName);
rowNum++;
for (ArrayList<String> al : listResultData) {
for (String s : al) {
sheet.autoSizeColumn(colNum);
row = sheet.getRow(rowNum - 1);
if (row == null)
row = sheet.createRow(rowNum - 1);
cell = row.getCell(colNum);
if (cell == null)
cell = row.createCell(colNum);
// cell style
// CellStyle cs = workbook.createCellStyle();
// cs.setWrapText(true);
// cell.setCellStyle(cs);
cell.setCellValue(s);
//System.out.print("Cell Value :: "+s);
colNum++;
}
rowNum++;
colNum = 0;
//System.out.println("");
}
fileOut = new FileOutputStream(path);
workbook.write(fileOut);
fileOut.close();
workbook.close();
fis.close();
} catch (Exception e) {
e.printStackTrace();
return false;
}
return true;
}
Please suggest !!

Instead of XSSF you may want to try SXSSF the streaming extension of XSSF. In contrast to xssf where you have access to all rows in the document which can lead to performance or heap space issue sxssf allows you to define a sliding window and limits the access to rows in that window. You can specify the window size at construction time of your workbook using new SXSSFWorkbook(int windowSize) . As you then create your rows and the number of rows exceed the specified window size, the row with the lowest index is flushed and is no longer in memory.
Find further infos at SXSSF (Streaming Usermodel API)
Example:
// keep 100 rows in memory, exceeding rows will be flushed to disk
SXSSFWorkbook wb = new SXSSFWorkbook(100);
Sheet sh = wb.createSheet();
for(int rownum = 0; rownum < 1000; rownum++){
//When the row count reaches 101, the row with rownum=0 is flushed to disk and removed from memory,
//when rownum reaches 102 then the row with rownum=1 is flushed, etc.
Row row = sh.createRow(rownum);
for(int cellnum = 0; cellnum < 10; cellnum++){
Cell cell = row.createCell(cellnum);
String address = new CellReference(cell).formatAsString();
cell.setCellValue(address);
}
}

Removing a row from an Excel sheet with Apache POI HSSF

I'm using the Apache POi HSSF library to import info into my application. The problem is that the files have some extra/empty rows that need to be removed first before parsing.
There's not a HSSFSheet.removeRow( int rowNum ) method. Only removeRow( HSSFRow row ). The problem with this it that empty rows can't be removed. For example:
sheet.removeRow( sheet.getRow(rowNum) );
gives a NullPointerException on empty rows because getRow() returns null.
Also, as I read on forums, removeRow() only erases the cell contents but the row is still there as an empty row.
Is there a way of removing rows (empty or not) without creating a whole new sheet without the rows that I want to remove?

/**
* Remove a row by its index
* #param sheet a Excel sheet
* #param rowIndex a 0 based index of removing row
*/
public static void removeRow(HSSFSheet sheet, int rowIndex) {
int lastRowNum=sheet.getLastRowNum();
if(rowIndex>=0&&rowIndex<lastRowNum){
sheet.shiftRows(rowIndex+1,lastRowNum, -1);
}
if(rowIndex==lastRowNum){
HSSFRow removingRow=sheet.getRow(rowIndex);
if(removingRow!=null){
sheet.removeRow(removingRow);
}
}
}

I know, this is a 3 year old question, but I had to solve the same problem recently, and I had to do it in C#. And here is the function I'm using with NPOI, .Net 4.0
public static void DeleteRow(this ISheet sheet, IRow row)
{
sheet.RemoveRow(row); // this only deletes all the cell values
int rowIndex = row.RowNum;
int lastRowNum = sheet.LastRowNum;
if (rowIndex >= 0 && rowIndex < lastRowNum)
{
sheet.ShiftRows(rowIndex + 1, lastRowNum, -1);
}
}

Something along the lines of
int newrownum=0;
for (int i=0; i<=sheet.getLastRowNum(); i++) {
HSSFRow row=sheet.getRow(i);
if (row) row.setRowNum(newrownum++);
}
should do the trick.

The HSSFRow has a method called setRowNum(int rowIndex).
When you have to "delete" a row, you put that index in a List. Then, when you get to the next row non-empty, you take an index from that list and set it calling setRowNum(), and remove the index from that list. (Or you can use a queue)

My special case (it worked for me):
//Various times to delete all the rows without units
for (int j=0;j<7;j++) {
//Follow all the rows to delete lines without units (and look for the TOTAL row)
for (int i=1;i<sheet.getLastRowNum();i++) {
//Starting on the 2nd row, ignoring first one
row = sheet.getRow(i);
cell = row.getCell(garMACode);
if (cell != null)
{
//Ignore empty rows (they have a "." on first column)
if (cell.getStringCellValue().compareTo(".") != 0) {
if (cell.getStringCellValue().compareTo("TOTAL") == 0) {
cell = row.getCell(garMAUnits+1);
cell.setCellType(HSSFCell.CELL_TYPE_FORMULA);
cell.setCellFormula("SUM(BB1" + ":BB" + (i - 1) + ")");
} else {
cell = row.getCell(garMAUnits);
if (cell != null) {
int valor = (int)(cell.getNumericCellValue());
if (valor == 0 ) {
//sheet.removeRow(row);
removeRow(sheet,i);
}
}
}
}
}
}
}

This answer is an extension over AndreAY's answer, Giving you complete function on deleting a row.
public boolean deleteRow(String sheetName, String excelPath, int rowNo) throws IOException {
XSSFWorkbook workbook = null;
XSSFSheet sheet = null;
try {
FileInputStream file = new FileInputStream(new File(excelPath));
workbook = new XSSFWorkbook(file);
sheet = workbook.getSheet(sheetName);
if (sheet == null) {
return false;
}
int lastRowNum = sheet.getLastRowNum();
if (rowNo >= 0 && rowNo < lastRowNum) {
sheet.shiftRows(rowNo + 1, lastRowNum, -1);
}
if (rowNo == lastRowNum) {
XSSFRow removingRow=sheet.getRow(rowNo);
if(removingRow != null) {
sheet.removeRow(removingRow);
}
}
file.close();
FileOutputStream outFile = new FileOutputStream(new File(excelPath));
workbook.write(outFile);
outFile.close();
} catch(Exception e) {
throw e;
} finally {
if(workbook != null)
workbook.close();
}
return false;
}

I'm trying to reach back into the depths of my brain for my POI-related experience from a year or two ago, but my first question would be: why do the rows need to be removed before parsing? Why don't you just catch the null result from the sheet.getRow(rowNum) call and move on?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: Read specific columns too memory consuming - java

Related

What is causing my program to bog down when writing to XSSF Workbook?

Writing data to an excel template with many formula and styles and color takes too much time

Apache POI copy one row from Excel file to another new Excel File

Inserting data from arraylist in chunks in excel file using apache poi

Removing a row from an Excel sheet with Apache POI HSSF

Categories

Resources