java read part of large xlsx file - java

what is the fastest and less memory intensive way to read a portion of very large xlsx file?
Currently I have this code:
FileInputStream fis = null;
fis = new FileInputStream("D:/verylargefile.xlsx");
XSSFWorkbook workbook = new XSSFWorkbook(fis);
XSSFSheet sheet = workbook.getSheetAt(0);
int r = sheet.getPhysicalNumberOfRows();
int c = sheet.getRow(1).getLastCellNum();
for (int row = 1; rows < r;row++){
for (int cell = 1; cell < c;cell++){
int cellvalue = (int)sheet.getRow(row).getCell(cell).getNumericCellValue()
//do some simple math op with that cell or several cells
}
}
So I need to do very large number of those simple math operations (for example average of every 5 cells in every row or something simillar) and very fast, with a small part of a very large xlsx file at once. With code above, I am getting heap space error with 10mb xlsx file and 1gb ram dedicated to java vm (-Xms1000M).
Thank you

Related

Apache POI only read the first row of a large Excel file

Can I read only the first row of an Excel file with Apache POI?
I don't want to read the whole file because it has 50,000 rows and takes up to 10 minutes to read (performance is a disaster). I am getting the bytes via file upload. My options are a byte array or an InputStream. Right now I am doing this:
Workbook workbook = new XSSFWorkbook(excelByteStream); //This line takes a lot of time, while debugging up to 10 minutes
Sheet firstSheet = workbook.getSheetAt(0);
DataFormatter df = new DataFormatter();
List<ColumnPanel> columnPanels = new ArrayList<>();
int i = 0;
for (Cell cell : firstSheet.getRow(0))
{
columnPanels.add(new ColumnPanel(df.formatCellValue(cell), i++));
}
InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
Workbook workbook = StreamingReader.builder()
.rowCacheSize(100)
.bufferSize(4096)
.open(is);
Sheet sheet = workBook.getSheetAt(0);
Row firstRow = sheet.rowIterator().next();
You can use this lib:
https://github.com/monitorjbl/excel-streaming-reader
Good guide:
Apache POI Streaming (SXSSF) for Reading

Repeatedly writing in same Excel file with Apache POI

I am using Java Apache poi library. I have to write data in an Excel file in chunks. It is my application scope that I can not write whole data at once to my Excel file so a batch size is fixed and writing data in batches (chunks). I am using following code.
XSSFWorkbook workbook = new XSSFWorkbook();
XSSFSheet sheet = workbook.createSheet("sheet");
int rowNum = startIndex;
Row excelRow = sheet.createRow(rowNum++);
int colNum = 1;
// Placing matrix results in rest of rows. Also keywrods in first column.
for(int rowIndex = 0; rowIndex < keywords.size(); rowIndex++) {
excelRow = sheet.createRow(rowNum++);
colNum = 0;
Cell cell = excelRow.createCell(colNum);
cell.setCellValue(keywords.get(rowIndex));
colNum++;
for(int colIndex = 0; colIndex < scoreResults[rowIndex].length; colIndex++) {
cell = excelRow.createCell(colNum);
cell.setCellValue(scoreResults[rowIndex][colIndex]);
colNum++;
}
}
FileOutputStream outputStream = new FileOutputStream(outputExcelFileName,true);
workbook.write(outputStream);
outputStream.close();
workbook.close();
This is written in a function and I have to call that function again and again. If my elements size is same of my batch size then there is no issue. File is created and opened successfully. Problem comes when let say batch size is 10 and my elements are 15. Then 2nd iteration is not happening successfully. Not any getting any error at run time but when I open excel file then MS(2010) report this error:
Excel found unreadable content in 'file_name'. Do you want to recover the contents of this workbook?
If I click on "yes" then it recovers contents of first iteration only. If batch size is 10 then only 10 elements will be recovered. So, issue exist after 1st iteration.
I have spent a lot of time in figuring out this issue but still unable to resolve it. If someone can help I will be thankful to you.

Reading excel files .xlsx via Java

So my excel file is relatively small in size. It contains 8 sheets. Each sheet has "records" of data which i need to read. Each sheet also has the first row reserved for headers which i skip; so my data will begin from the 2nd row (1st index) of each sheet and end on the last record.
So, below is my code to iterate through the sheets and read each row however it fails to read each sheet. And i can't seem to figure out why. Please have look and any suggestions will be appreciated.
Thanks!
FileInputStream fis = new FileInputStream(new File(filePath));
XSSFWorkbook wb = new XSSFWorkbook(fis);
DataFormatter formatter = new DataFormatter();
//iterate over sheets
for (int i=0; i<NUM_OF_SHEETS; i++) {
sheet = wb.getSheetAt(i);
sheetName = sheet.getSheetName();
//iterate over rows
for (int j=1; j<=lastRow; j++) { //1st row or 0-index of each sheet is reserved for the headings which i do not need.
row = sheet.getRow(j);
if (row!=null) {
data[j-1][0] = sheetName; //1st column or 0th-index of each record in my 2d array is reserved for the sheet's name.
//iterate over cells
for (int k=0; k<NUM_OF_COLUMNS; k++) {
cell = row.getCell(k, XSSFRow.RETURN_BLANK_AS_NULL);
cellValue = formatter.formatCellValue(cell); //convert cell to type String
data[j-1][k+1] = cellValue;
}//end of cell iteration
}
}//end of row iteration
}//end of sheet iteration
wb.close();
fis.close();
At least there is one big logical error. Since you are putting the data of all sheets in one array, this array must be dimensioned like:
String[][] data = new String[lastRow*NUM_OF_SHEETS][NUM_OF_COLUMNS+1];
And then the allocations must be like:
...
data[(j-1)+(i*lastRow)][0] = sheetName; //1st column or 0th-index of each record in my 2d array is reserved for the sheet's name.
...
and
...
data[(j-1)+(i*lastRow)][k+1] = cellValue;
...
With your code, the allocations from second sheet will overwrite the ones from the first sheet, since j starts with 1 for every sheet.

SXSSF: to where does it flush rows not in the window prior to output to file?

According to the SXSSF (Streaming Usermodel API) documentation:
SXSSF (package: org.apache.poi.xssf.streaming) is an API-compatible streaming extension of XSSF to be used when very large spreadsheets have to be produced, and heap space is limited. SXSSF achieves its low memory footprint by limiting access to the rows that are within a sliding window, while XSSF gives access to all rows in the document. Older rows that are no longer in the window become inaccessible, as they are written to the disk.
However, in the provided example the flush happens before the workbook is given the file location at which to write the file.
public static void main(String[] args) throws Throwable {
Workbook wb = new SXSSFWorkbook(100); // keep 100 rows in memory, exceeding rows will be flushed to disk
Sheet sh = wb.createSheet();
for(int rownum = 0; rownum < 1000; rownum++){
Row row = sh.createRow(rownum);
for(int cellnum = 0; cellnum < 10; cellnum++){
Cell cell = row.createCell(cellnum);
String address = new CellReference(cell).formatAsString();
cell.setCellValue(address);
}
}
// Rows with rownum < 900 are flushed and not accessible
for(int rownum = 0; rownum < 900; rownum++){
Assert.assertNull(sh.getRow(rownum));
}
// ther last 100 rows are still in memory
for(int rownum = 900; rownum < 1000; rownum++){
Assert.assertNotNull(sh.getRow(rownum));
}
FileOutputStream out = new FileOutputStream("/temp/sxssf.xlsx");
wb.write(out);
out.close();
}
So this begs the questions:
Where on the file system is it storing the data?
Is it just creating a temp file in the default temp directory?
Is this safe for all / most implementations?
The class that does the buffering is SheetDataWriter in org.apache.poi.xssf.streaming.SXSSFSheet
The magic line you're probably interested in is:
_fd = File.createTempFile("poi-sxxsf-sheet", ".xml");
In terms of is that safe, probably, but not certainly... It's likely worth opening a bug in the poi bugzilla, and requesting it be switched to using org.apache.poi.util.TempFile which allows a bit more control. In general though, as long as you specify a valid property for java.io.tmpdir (or the default is sensible for you) you should be fine

Is there is Limition of using opencsv API or Apache Poi Api?

I am creating Excel file on the basis of CSV file.for reading CSV file,i am using Opencsv API and Apache POI.In my csv contain 65537 row.
class Test {
public static void main(String[] args) throws IOException {
Workbook wb = new HSSFWorkbook();
CreationHelper helper = wb.getCreationHelper();
Sheet sheet = wb.createSheet("new sheet");
CSVReader reader = new CSVReader(new FileReader("SampleData.csv"));
String[] line;
int r = 0;int count=0;
while ((line = reader.readNext()) != null) {
Row row = sheet.createRow((short) r++);
count=count+1;
System.out.println("count-"+count);
for (int i = 0; i < line.length; i++)
row.createCell(i)
.setCellValue(helper.createRichTextString(line[i]));
}
// Write the output to a file
FileOutputStream fileOut = new FileOutputStream("workbook.xls");
wb.write(fileOut);
fileOut.close();
}}
when i run this program it give me following error:
Exception in thread "main" java.lang.IllegalArgumentException: Invalid row number (-32768) outside allowable range (0..65535)at org.apache.poi.hssf.usermodel.HSSFRow.setRowNum(HSSFRow.java:232)
at org.apache.poi.hssf.usermodel.HSSFRow.<init>(HSSFRow.java:86)
at org.apache.poi.hssf.usermodel.HSSFRow.<init>(HSSFRow.java:70)
at org.apache.poi.hssf.usermodel.HSSFSheet.createRow(HSSFSheet.java:205)
at org.apache.poi.hssf.usermodel.HSSFSheet.createRow(HSSFSheet.java:71)
at com.arosys.utilityclasses.Test.main(Test.java:23)Java Result: 1
I tried to trace how much row it support i found it only support 32768 and also tried for less number of row,it works nicely and create excel file.
please help me to sort out this problem,if my csv contain 65536 row then how i am bale to write excel file(Xls).
Thanks
Why are you casting the row num to short in the following line
Row row = sheet.createRow((short) r++);
Leave it as int
Looks like it's your short value. the max value on a short is 32767 and you're trying to access one more than that.

Categories

Resources