I am looking to write to an excel (.xls MS Excel 2003 format) file programatically using Java. The excel output files may contain ~200,000 rows which I plan to split over number of sheets (64k rows per sheet, due to the excel limit).
I have tried using the apache POI APIs but it seems to be a memory hog due to the API object model. I am forced to add cells/sheets to the workbook object in memory and only once all data is added, I can write the workbook to a file! Here is a sample of how the apache recommends i write excel files using their API:
Workbook wb = new HSSFWorkbook();
Sheet sheet = wb.createSheet("new sheet");
//Create a row and put some cells in it
Row row = sheet.createRow((short)0);
// Create a cell and put a value in it.
Cell cell = row.createCell(0);
cell.setCellValue(1);
// Write the output to a file
FileOutputStream fileOut = new FileOutputStream("workbook.xls");
wb.write(fileOut);
fileOut.close();
Clearly, writing ~20k rows(with some 10-20 columns in each row) gives me the dreaded "java.lang.OutOfMemoryError: Java heap space".
I have tried increasing JVM initial heapsize and max heap size using Xms and Xmx parameters as Xms512m and Xmx1024. Still cant write more than 150k rows to the file.
I am looking for a way to stream to an excel file instead of building the entire file in memory before writing it to disk which will hopefully save a lot of memory usage. Any alternative API or solutions would be appreciated, but I am restricted to usage of java. Thanks! :)
Try to use SXSSF workbook, thats great thing for huge xls documents, its build document and don't eat RAM at all, becase using nio
All existing Java APIs try to build the whole document in RAM at once. Try to write an XML file which conforms to the new xslx file format instead. To get you started, I suggest to build a small file in the desired form in Excel and save it. Then open it and examine the structure and replace the parts you want.
Wikipedia has a good article about the overall format.
I had to split my files into several excel files in order to overcome the heap space exception. I figured that around 5k rows with 22 columns was about it, so I just made my logic so that every 5k row I would end the file, start a new one and just numerate the files accordingly.
In the cases where I had 20k + rows to be written I would have 4+ different files representing the data.
Have a look at the HSSF serializer from the cocoon project.
The HSSF serializer catches SAX events and creates a spreadsheet in the XLS format used by Microsoft Excel
There also is JExcelApi, but its uses more memory. i think you should create .csv file and open it in excel. it allows you to pass a lot of data, but you wont be able to do any "excel magic".
Consider using CSV format. This way you aren't limited by memory anymore --well, maybe only during prepopulating the data for CSV, but this can be done efficiently as well, for example querying subsets of rows from DB using for example LIMIT/OFFSET and immediately write it to file instead of hauling the entire DB table contents into Java's memory before writing any line. The Excel limitation of the amount rows in one "sheet" will increase to about one million.
That said, if the data is actually coming from a DB, then I would highly reconsider if Java is the right tool for this. Most decent DB's have an export-to-CSV function which can do this task undoubtely much more efficient. In case of for example MySQL, you can use the LOAD DATA INFILE command for this.
We developed a java library for this purpose and currently it is available as open source project https://github.com/jbaliuka/x4j-analytic . We use it for operational reporting.
We generate huge Excel files, ~200,000 should work without problems, Excel manages to open such files too.
Our code uses POI to load template but generated content is streamed directly to file without XML or Object model layer in memory.
Is this memory issue happen when you insert data into cell, or when you perform data computation/generation?
If you are going to load files into an excel that consist of predefined static template format, then better to save a template and reuse multiple time. Normally template cases happen when you are going to generate daily sales report or etc...
Else, every time you need to create new row, border, column etc from scratch.
So far, Apache POI is the only choice I found.
"Clearly, writing ~20k rows(with some 10-20 columns in each row) gives me the dreaded "java.lang.OutOfMemoryError: Java heap space"."
"Enterprise IT"
What YOU CAN DO is- perform batch data insertion. Create a queuetask table, everytime after generate 1 page, rest for seconds, then continue second portion. If you are worry about the dynamic data changes during your queue task, you can first get the primary key into the excel (by hiding and lock the column from user view). First run will be insert primary key, then second queue run onwards will read out from notepad and do the task portion by portion.
We did something quite similar, same amount of data, and we had to switch to JExcelapi because POI is so heavy on resources. Try JexcelApi, you won't regret it when you have to manipulate big Excel-files!
Related
I'm working on a large excel spreadsheet for a datamining project related to housing costs. There are multiple sheets in this file each with 20-50 columns and around 20,000 rows.
For each sheet, I need to create two more sheets. One will contain a random sample of 10% of the rows. The other will contain the other 90% of rows not included in the sample. Are there any excel commands or plugins to easily achieve this?
VBA Excel: You can start recording a macro in Excel. Then you do some major steps MANUALLY, while recording is ON. Then you stop recording and look at the macro.
Then, you can also run the macro on the original file (the file before you did something MANUALLY).
Maybe, this macro will also work for you, if just some values in the tables change.
Anyway, usually, you have to look at the code of the macro, and edit the code for the adaption of e.g. file name, number of columns, or whatever. If you are not sure about something - record just one MANUAL step and see which VBA command is invoked. OR as I said, really just change values in the sheet. And the original recorded macro, might already completely be okay to be run on the new values. You will find a lot of infos on VBA Excel - the start is: just record a macro ...
I'm working with Apache Poi XSSFWorkbooks to manipulate xlsx files; my program works fine on small excel files (60 000 rows). When I started to test my code on a big file (700 000 rows) I had a memory problem. I test my code on a computer with 16 GB of RAM and it doesn't work.
Any help with this issue? I read about SAX parser but I don't want to modify my code, moreover I don't find it intuitive to use; it's not simple as xssf which have simple methods to get cells,rows..etc
Is there a way to keep my code as it is and solve the memory problem? Or any solutions apart from SAX parser? Any help is appreciated, thanks.
From experience, SAX really helps a lot with memory performance. Went from 4GB+ to around 300MB.
Some useful links and other tips:
From https://poi.apache.org/spreadsheet/limitations.html
File sizes/Memory usage
There are some inherent limits in the Excel file formats. These are
defined in class SpreadsheetVersion. As long as you have enough
main-memory, you should be able to handle files up to these limits.
For huge files using the default POI classes you will likely need a
very large amount of memory.
There are ways to overcome the main-memory limitations if needed: For
writing very huge files, there is SXSSFWorkbook which allows to do a
streaming write of data out to files (with certain limitations on what
you can do as only parts of the file are held in memory). For reading
very huge files, take a look at the sample XLSX2CSV which shows how
you can read a file in streaming fashion (again with some limitations
on what information you can read out of the file, but there are ways
to get at most of it if necessary).
Also
https://poi.apache.org/faq.html#faq-N10165
I think POI is using too much memory! What can I do? This one comes up quite a lot, but often the reason isn't what you might
initially think. So, the first thing to check is - what's the source
of the problem? Your file? Your code? Your environment? Or Apache POI?
(If you're here, you probably think it's Apache POI. However, it often
isn't! A moderate laptop, with a decent but not excessive heap size,
from a standing start, can normally read or write a file with 100
columns and 100,000 rows in under a couple of seconds, including the
time to start the JVM).
Apache POI ships with a few programs and a few example programs, which
can be used to do some basic performance checks. For testing file
generation, the class to use is in the examples package,
SSPerformanceTest (viewvc). Run SSPerformanceTest with arguments of
the writing type (HSSF, XSSF or SXSSF), the number rows, the number of
columns, and if the file should be saved. If you can't run that with
50,000 rows and 50 columns in HSSF and SXSSF in under 3 seconds, and
XSSF in under 10 seconds (and ideally all 3 in less than that!), then
the problem is with your environment.
Next, use the example program ToCSV (viewvc) to try reading the a file
in with HSSF or XSSF. Related is XLSX2CSV (viewvc), which uses SAX
parsing for .xlsx. Run this against both your problem file, and a
simple one generated by SSPerformanceTest of the same size. If this is
slow, then there could be an Apache POI problem with how the file is
being processed (POI makes some assumptions that might not always be
right on all files). If these tests are fast, then any performance
problems are in your code!
And
Files vs InputStreams http://poi.apache.org/spreadsheet/quick-guide.html#FileInputStream
When opening a workbook, either a .xls HSSFWorkbook, or a .xlsx XSSFWorkbook, the Workbook can be loaded from either a File or an InputStream. Using a File object allows for lower memory consumption, while an InputStream requires more memory as it has to buffer the whole file.
If using WorkbookFactory, it's very easy to use one or the other:
// Use a file
Workbook wb = WorkbookFactory.create(new File("MyExcel.xls"));
// Use an InputStream, needs more memory
Workbook wb = WorkbookFactory.create(new FileInputStream("MyExcel.xlsx"));
If using HSSFWorkbook or XSSFWorkbook directly, you should generally
go through NPOIFSFileSystem or OPCPackage, to have full control of the
lifecycle (including closing the file when done):
// HSSFWorkbook, File
NPOIFSFileSystem fs = new NPOIFSFileSystem(new File("file.xls"));
HSSFWorkbook wb = new HSSFWorkbook(fs.getRoot(), true);
....
fs.close();
// HSSFWorkbook, InputStream, needs more memory
NPOIFSFileSystem fs = new NPOIFSFileSystem(myInputStream);
HSSFWorkbook wb = new HSSFWorkbook(fs.getRoot(), true);
// XSSFWorkbook, File
OPCPackage pkg = OPCPackage.open(new File("file.xlsx"));
XSSFWorkbook wb = new XSSFWorkbook(pkg);
....
pkg.close();
// XSSFWorkbook, InputStream, needs more memory
OPCPackage pkg = OPCPackage.open(myInputStream);
XSSFWorkbook wb = new XSSFWorkbook(pkg);
....
pkg.close();
I have following two requirements:
To read a CSV file and put rows line by line into the database (RDSMS) without any data manipulation.
To read a CSV file and put this data into the database (RDBMS). In this case, row Z might be dependent on row B. So need to have a staging DB (in-memory or another a staging RDBMS)
I am analyzing multiple ways to accomplish this:
Using Core java, and read file in Producer-consumer way.
Using Apache Camel and BeanIO to read the csv file.
Using SQL to read the file.
Wanted to know, if is there an already industry defined preferred way to do such kind of tasks?
I found few links on stackoverflow, but I am looking for more options:
How to read a large text file line by line using Java?
Read a huge file of numbers in Java in a memory-efficient way?
Read large CSV in java
I am using Java6 for implementation.
you should use NIO package to do such stuff in GBs. NIO is asynchronous, fastest till date and most reliable. you can simple read files in chunks via NIO packaging and then insert into db using bulk commands rather than single insertion. Single insertion take lot of your CPU cycles and may cause OOM errors.
We are using Apache Camel's "File:" protocol to read file and process the data.
You can use RandomAccessFile for reading csv file, it gives you fast enough read speed, it does not requires any extra jar file, here is code,
File f=new File(System.getProperty("user.home")+"/Desktop/CSVDOC1.csv");
RandomAccessFile ra = new RandomAccessFile(f,"rw");
ra.seek(0);//Read from start
long p=ra.getFilePointer();
String d= ra.readLine();
ra.seek(p);
while(d!=null){
//Each line data stored in variable d
d=ra.readLine();
//d="col1","col2","col2","col3"
//Separate line data by separator ","
//insert row values into database
}
//Release file lock
ra.close();
I have an excel file which contain multiple sheets. I want to relate these sheets with each other Ex.
Here master is the root table, I will read the last column, if the names in it matches with the any tab name I will read the tab .
and atlast I will dump the data in some java classes which will represent these excel sheets.
So when i will use the data in code it will be get by. Master.Polygon.Cord etc.
Please suggest a way for it using POI .
You can use a combination of getSheetIndex(name) and getSheetAt(index) using the HSSFWorkbook class
http://poi.apache.org/apidocs/index.html
You can use
if(workbook.getSheet(cell.toString())!=null){
sheet=wb.getSheet(cell.toString());
//code after getting the sheet
}
I think you gave up too early. Whenever you are doing development the documentation is your best friend. You can always look at few examples. Its a straight forward task you just have to load the excel file and extract sheet by sheet and perform you task. You can also look on some tutorials for jxl they are very similar for beginners.
Can anyone tell me where do I find some useful documentation on handling copying rows, cells, columns from one excel file to another, using POI?
I need to insert in one blank excel file, 2 or more templates from other files, dynamic.
I also need to keep all the styles made for the group of cells that I copy. How can I do that? Nothing found on apache poi tutorial on this point.
I am using POI 3.0.1.
Thank you!
I assume the problem is data types and merged cells? It's easy enough to get and set styles and set values.
Depending on your use case, you might be able to take entire sheets from the original document, assemble the new document from those and tweak it to your liking. Even if you have to combine multiple source sheets into one target sheet, you might still be able to retrieve source rows and assemble the target document from those rows.
...that was me some time ago...
I never could copy from one excel file to another with the exact style, but I found a solution : I used multiple worksheets instead of multiple excel files, cause style has no problem in being copied from one sheet to another as long as it is in the same workbook.
I also migrated from POI3.0.1 to POI 3.6. Far much better.