how to read a large data of excel file (xlsx) using java

how to read a large data of excel file (xlsx) using java - java

This coding is able to read the small data of excel file... but not reading the large data files in excel files.... how to modify the code further?
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.sql.SQLException;
import java.util.Iterator;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
/**
*
* #author Administrator
*/
public class ReadExcelNdArray {
public static void main(String[] args) throws Exception {
long start = System.currentTimeMillis();
System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");
File myFile = new File("D://Raghulpr/Transaction Data.xlsx");
FileInputStream fis = new FileInputStream(myFile);
// Finds the workbook instance for XLSX file
XSSFWorkbook myWorkBook = new XSSFWorkbook (fis);
// Return first sheet from the XLSX workbook
XSSFSheet mySheet = myWorkBook.getSheetAt(0);
// Get iterator to all the rows in current sheet
Iterator<Row> rowIterator = mySheet.iterator();
// Traversing over each row of XLSX file
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
// For each row, iterate through each columns
Iterator<Cell> cellIterator = row.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
switch (cell.getCellType()) {
case Cell.CELL_TYPE_STRING:
System.out.print(cell.getStringCellValue() + "\t");
break;
case Cell.CELL_TYPE_NUMERIC:
System.out.print(cell.getNumericCellValue() + "\t");
break;
case Cell.CELL_TYPE_BOOLEAN:
System.out.print(cell.getBooleanCellValue() + "\t");
break;
default :
}
}
System.out.println("");
}
}
}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.io.ByteArrayOutputStream.<init>(ByteArrayOutputStream.java:77)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource$FakeZipEntry.<init>(ZipInputStreamZipEntrySource.java:121)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:55)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:88)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:272)
at org.apache.poi.util.PackageHelper.open(PackageHelper.java:37)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:254)
at readexcelndarray.ReadExcelNdArray.main(ReadExcelNdArray.java:36)

I don't know if you still need answer to this, but I was also searching for the same and was struggling to read a large file . After spending a lot of time all over the internet I found one solution to this . You can check
Excel streaming reader
import com.monitorjbl.xlsx.StreamingReader;
InputStream is = new FileInputStream(new File("G:\\Book1.xlsx"));
Workbook workbook = StreamingReader.builder()
.rowCacheSize(100)
.bufferSize(4096)
.open(is);
Now you can use workbook to process your file further .
I was able to process xlsx file having more than 4 lac records .

Firstly you need to close all Input - output stream object like FileInputStream etc in your code. Secondly, you can also increase your JVM heap space as mention in this link: Increase heap size in Java

We have jxl api for reading, writing excel files. The problem with this api is at the max you can read and write 65535 rows while starting row is indexed at 0. But it's really flexible.
Since, number of rows are more than 65535 in your case, I would suggest you to prefer Apache POI. Virtually, there is no limit for this api.

You need to increase the heap size so as to read the large files.I suggest using 64bit machine.

I've had the same problem, if you change to the much lower level SAX parsing instead you will save a lot of memory. http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api
I think I reduced about 4.5 GB(!) memory usage (about 11MB file with a lot of formulas) down to something more manageable (don't remember exactly, but it was so low it didn't matter anymore, at least reduced by a factor of 10).
Harder to implement but worth the time if you need to reduce memory footprint

Related

PackagePropertiesMarshaller$NamespaceImpl not found using Apache poi with Java Servlet

I've been trying to build my first web application using IntelliJ and Tomcat, and one of the tasks is being able to upload and process an Excel sheet file. So, I looked up online, and found the Apache POI library that can help me parse an Excel file. But when I downloaded all the required jars and copied and pasted some testing code, and start up the server, it shows on the webpage an error with http status 500, the root cause being: java.lang.ClassNotFoundException: org.apache.poi.openxml4j.opc.internal.marshallers.PackagePropertiesMarshaller$NamespaceImpl.
I've encountered the problem with other jars, but all solved by putting the corresponding jars inside tomcat's lib folder, just except for this one.
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import java.io.File;
import java.io.FileInputStream;
import java.util.Iterator;
public class ExcelParser {
private String pathname;
public ExcelParser(String pathname) {
this.pathname = pathname;
}
public void parse() {
try {
FileInputStream file = new FileInputStream(new File("/Users/JohnDoe/Desktop/test.xlsx"));
//Create Workbook instance holding reference to .xlsx file
XSSFWorkbook workbook = new XSSFWorkbook(file);
//Get first/desired sheet from the workbook
XSSFSheet sheet = workbook.getSheetAt(0);
//Iterate through each rows one by one
Iterator<Row> rowIterator = sheet.iterator();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
//For each row, iterate through all the columns
Iterator<Cell> cellIterator = row.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
//Check the cell type and format accordingly
switch (cell.getCellType()) {
case NUMERIC:
System.out.print(cell.getNumericCellValue() + "t");
break;
case STRING:
System.out.print(cell.getStringCellValue() + "t");
break;
}
}
System.out.println();
}
file.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
I'm just testing the functionality of Excel parsing, so don't really worry about the pathname.
Btw, I can see that this (inner) class is declared in poi-ooxml4-4.1.0.jar, which is also included in my Tomcat lib folder.
Any ideas why this is happening, and how I should fix it is appreciated.

To use Apache POI, you need the following jar files.
poi-ooxml-4.1.0.jar
poi-ooxml-schemas-4.1.0.jar
xmlbeans-3.1.0.jar
commons-compress-1.18.jar
curvesapi-1.06.jar
poi-4.1.0.jar
commons-codec-1.12.jar
commons-collections4-4.3.jar
commons-math3-3.6.1.jar
You can refer to the following link, which I have answered few things.
Unable to read Excel using Apache POI

I think I missed something when moving the jars to the lib directory, as I removed the original files and redo the cp command, everything works now. I'm closing the question with answer, thanks for the help!

Apache poi - Remove first row in excel and save it to the same file

Hello i need help with this, i tried about 30 tutorials last few hours and i dont know how to solve it:
Open Excel File
Delete and remove row "A" ( to be replaced in for later by excel row B,C,D,...)
Rewrite opened Excel File ( because if program crash for high usage i need to have stored last value, and start program again without searching and deleting which was used ... )
OPCPackage fileInputStream = OPCPackage.open(new File("input.xlsx"));
XSSFWorkbook workbook = new XSSFWorkbook(fileInputStream);
XSSFSheet worksheet = workbook.getSheetAt(0);
worksheet.shiftRows(0, 0, 1);
workbook.write(new FileOutputStream("input.xlsx"));
This code dont remove row a and dont save file to the same location ...
Could anybody help me please?
Thank you FJ

First problem in your code:
The worksheet.shiftRows(0, 0, 1); shifts first row one row downwards. If the need is removing first row, then second row up to last row should be shifted one row upwards. This would be worksheet.shiftRows(1, worksheet.getLastRowNum(), -1);.
Second problem in your code:
If a File is used for creating a Workbook then the workbook cannot be written into the same file. This is because the File used stays opened until the workbook will be closed. So we should not using a File here for opening the workbook but a FileInputStream instead.
Working example:
import org.apache.poi.ss.usermodel.*;
import java.io.FileInputStream;
import java.io.FileOutputStream;
class ReadExcelRemoveRowAndWrite {
public static void main(String[] args) throws Exception {
Workbook workbook = WorkbookFactory.create(new FileInputStream("input.xlsx"));
Sheet worksheet = workbook.getSheetAt(0);
worksheet.shiftRows(1, worksheet.getLastRowNum(), -1);
workbook.write(new FileOutputStream("input.xlsx"));
workbook.close();
}
}

How to read xlsx files sequentially

I have a large xlsx file (74 Mbyte). I have found a way to read it in. Here is my source code so far.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Iterator;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
private static void readXLSX(String path) throws IOException {
File myFile = new File(path);
FileInputStream fis = new FileInputStream(myFile);
// Finds the workbook instance for XLSX file
XSSFWorkbook myWorkBook = new XSSFWorkbook (fis);
// Return first sheet from the XLSX workbook
XSSFSheet mySheet = myWorkBook.getSheetAt(0);
// Get iterator to all the rows in current sheet
Iterator<Row> rowIterator = mySheet.iterator();
// Traversing over each row of XLSX file
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
// For each row, iterate through each columns
Iterator<Cell> cellIterator = row.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
switch (cell.getCellType()) {
case Cell.CELL_TYPE_STRING:
System.out.print(cell.getStringCellValue() + "\t");
break;
case Cell.CELL_TYPE_NUMERIC:
System.out.print(cell.getNumericCellValue() + "\t");
break;
case Cell.CELL_TYPE_BOOLEAN:
System.out.print(cell.getBooleanCellValue() + "\t");
break;
default :
}
}
System.out.println("");
}
}
The problem is that my 8 GByte Ram doesn't seem to be sufficient, even using swapping and extending the JVM memory.
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Do You have any idea why this code is so inefficient? Or maybe You have an idea how to read this code sequentially and buffer the temporary rows in a less memory consuming way?
Thanks in advance

Using XSSF version of Poi is known to cause memory issues. You can use the streaming alternative, this will ensure you wont run out of memory.
In short, use this alternative
SXSSFWorkbook instead of XSSFWorkbook
API details here

Avoid formula injection, keeping cell value (quote prefix in HSSF/*.xls)

The application I am working on creates Excel exports using Apache POI. It was brought to our attention, through a security audit, that cells containing malicious values can spawn arbitrary processes if the user is not careful enough.
To reproduce, run the following:
import java.io.FileOutputStream;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
public class BadWorkbookCreator {
public static void main(String[] args) throws Exception {
try(
Workbook wb = new HSSFWorkbook();
FileOutputStream fos = new FileOutputStream("C:/workbook-bad.xls")
) {
Sheet sheet = wb.createSheet("Sheet");
Row row = sheet.createRow(0);
row.createCell(0).setCellValue("Aaaaaaaaaa");
row.createCell(1).setCellValue("-2+3 +cmd|'/C calc'!G20");
wb.write(fos);
}
}
}
Then open the resulting file:
And follow these steps:
Click on (A) to select the cell with malicious content
Click on (B) so that the cursor is in the formula editor
Press ENTER
You will be asked if you allow Excel to run an external application; if you answer yes, Calc is launched (or any malicious code)
One may say that the user is responsible for letting Excel run arbitrary things and the user was warned. But still, the Excel is downloaded from a trusted source and someone may fall into the trap.
Using Excel, you can place a single quote in front of the text in the formula editor to escape it. Placing the single quote in the cell content programmatically (e.g. code as below) makes the single quote visible!
String cellValue = cell.getStringCellValue();
if( cellValue != null && "=-+#".indexOf(cellValue.charAt(0)) >= 0 ) {
cell.setCellValue("'" + cellValue);
}
The question: Is there a way to keep the value escaped in the formula editor, but show the correct value, without the leading single quote, in the cell?

Thanks to the hard work investigating of Axel Richter here and Nikos Paraskevopoulos here....
From Apache POI 3.16 beta 1 onwards (or for those who live dangerously, any nightly build after 20161105), there are handy methods on CellStyle for getQuotePrefixed and setQuotePrefixed(boolean)
Your code could then become:
// Do this once for the workbook
CellStyle safeFormulaStyle = workbook.createCellStyle();
safeFormulaStyle.setQuotePrefixed(true);
// Per cell
String cellValue = cell.getStringCellValue();
if( cellValue != null && "=-+#".indexOf(cellValue.charAt(0)) >= 0 ) {
cell.setCellStyle(safeFormulaStyle);
}

Thanks to the instant (kudos) response from the POI team (see accepted answer), this solution should be obsolete. Keeping it as a reference, could be useful in cases an upgrade to POI >= 3.16 is not possible.
Thanks to the comment of Axel Richter (for which I am very-very thankful) I managed to work out a solution. It is definitely NOT as straightforward as in the case of XLSX files (XSSFWorkbook), because it involves creating the org.apache.poi.hssf.model.InternalWorkbook by hand; this class is marked as #Internal by the POI project, but is public as far as Java is concerned. Additionally, the field that is set to correct the problem, i.e. ExtendedFormatRecord.set123Prefix(true) is not documented!
Here is the solution, for what it's worth - compare it with the code in the question:
import java.io.FileOutputStream;
import org.apache.poi.hssf.model.InternalWorkbook;
import org.apache.poi.hssf.record.ExtendedFormatRecord;
import org.apache.poi.hssf.usermodel.HSSFCellStyle;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
public class GoodWorkbookCreator {
public static void main(String[] args) throws Exception {
InternalWorkbook internalWorkbook = InternalWorkbook.createWorkbook();
try(
HSSFWorkbook wb = HSSFWorkbook.create(internalWorkbook);
FileOutputStream fos = new FileOutputStream("C:/workbook-good.xls")
) {
HSSFCellStyle style = (HSSFCellStyle) wb.createCellStyle();
ExtendedFormatRecord xfr = internalWorkbook.getExFormatAt(internalWorkbook.getNumExFormats() - 1);
xfr.set123Prefix(true); // THIS IS WHAT IT IS ALL ABOUT
Sheet sheet = wb.createSheet("Sheet");
Row row = sheet.createRow(0);
row.createCell(0).setCellValue("Aaaaaaaaaa");
row.createCell(1).setCellValue("-2+3 +cmd|'/C calc'!G20");
Cell cell = row.createCell(2);
cell.setCellValue("-2+3 +cmd|'/C calc'!G20");
cell.setCellStyle(style);
wb.write(fos);
}
}
}

How to write a very large number in xlsx via apache poi

I need to write to an excel cell a very large numbers(>91430000000000000000)
The issue is that max value for cell is 9143018315613270000, and all values which is larger - would be replaced by max value.
This issue will simply resolved by hands if an apostrophe is added to an number, for example '9143018315313276189
But how to the same trick via apache POI? I have follow code:
attrId.setCellValue(new XSSFRichTextString('\'' + value.getId().toString()));
But it doesn't work:
Here the first row haven't any apostrophe at all, second one is written by hands and it is the result I'm looking for. Third is a result of my code. I also tried to use setCellValue which takes double and String, both of them doesn't help me ether.
So, here goes the question: How to write in excel a very large numbers via apache POI?

Set the cell style first
DataFormat format = workbook.createDataFormat();
CellStyle testStyle = workbook.createCellStyle();
testStyle.setDataFormat(format.getFormat("#"));
String bigNumber = "9143018315313276189";
row.createCell(40).setCellStyle(testStyle);
row.getCell(40).setCellValue(bigNumber);

Can you set the Cell type and see what happens. Or if you have already set that then please post your code so that others look at it.
cell.setCellType(Cell.CELL_TYPE_STRING);
Please refer to the question in here for details on how to set string value to cell How can I read numeric strings in Excel cells as string (not numbers) with Apache POI?
I did the following sample and worked for me (poi-3.1.3)
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
public class WriteToExcel {
public static void main(String[] args) throws IOException {
HSSFWorkbook workbook = new HSSFWorkbook();
HSSFSheet sheet = workbook.createSheet("Sample sheet");
Row row = sheet.createRow(0);
Cell cell = row.createCell(0);
cell.setCellType(Cell.CELL_TYPE_STRING);
cell.setCellValue("91430183153132761893333");
try {
FileOutputStream out =
new FileOutputStream(new File("C:\\test_stackoverflow\\new.xls"));
workbook.write(out);
out.close();
System.out.println("Excel written successfully..");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to read a large data of excel file (xlsx) using java - java

Firstly you need to close all Input - output stream object like FileInputStream etc in your code. Secondly, you can also increase your JVM heap space as mention in this link: Increase heap size in Java

You need to increase the heap size so as to read the large files.I suggest using 64bit machine.

Related

PackagePropertiesMarshaller$NamespaceImpl not found using Apache poi with Java Servlet

Apache poi - Remove first row in excel and save it to the same file

How to read xlsx files sequentially

Avoid formula injection, keeping cell value (quote prefix in HSSF/*.xls)

How to write a very large number in xlsx via apache poi

Categories

Resources