Cutting Rows to Create Sample in an Excel File

Cutting Rows to Create Sample in an Excel File - java

I'm working on a large excel spreadsheet for a datamining project related to housing costs. There are multiple sheets in this file each with 20-50 columns and around 20,000 rows.
For each sheet, I need to create two more sheets. One will contain a random sample of 10% of the rows. The other will contain the other 90% of rows not included in the sample. Are there any excel commands or plugins to easily achieve this?

VBA Excel: You can start recording a macro in Excel. Then you do some major steps MANUALLY, while recording is ON. Then you stop recording and look at the macro.
Then, you can also run the macro on the original file (the file before you did something MANUALLY).
Maybe, this macro will also work for you, if just some values in the tables change.
Anyway, usually, you have to look at the code of the macro, and edit the code for the adaption of e.g. file name, number of columns, or whatever. If you are not sure about something - record just one MANUAL step and see which VBA command is invoked. OR as I said, really just change values in the sheet. And the original recorded macro, might already completely be okay to be run on the new values. You will find a lot of infos on VBA Excel - the start is: just record a macro ...

Related

Read from 1 excel and write those values in another excel using vlookups in java

I am having an excel document with names populated. I want to read the characteristics of the names from multiple other documents using vlookup. I am using java poi(just started, so open to look at other solutions). I also need to do some calculations on some of the populated columns.
So far, I am able to read an excel file. Any other pointers would be helpful.
Thanks.
Update: More information
I have 2 files,
File 1 columns - Name, detail1,detail2
File 2 columns - Name, detail3,detail4 & so on
So I need to read both the files and copy the File 2 columns in File1 using vlookups. I need to automate this using java.
I have the code to read and write the files, so I just need to implement the vlookups.

Reading excel data from multiple related excel sheets and dumping in java classes

I have an excel file which contain multiple sheets. I want to relate these sheets with each other Ex.
Here master is the root table, I will read the last column, if the names in it matches with the any tab name I will read the tab .
and atlast I will dump the data in some java classes which will represent these excel sheets.
So when i will use the data in code it will be get by. Master.Polygon.Cord etc.
Please suggest a way for it using POI .

You can use a combination of getSheetIndex(name) and getSheetAt(index) using the HSSFWorkbook class
http://poi.apache.org/apidocs/index.html

You can use
if(workbook.getSheet(cell.toString())!=null){
sheet=wb.getSheet(cell.toString());
//code after getting the sheet
}

I think you gave up too early. Whenever you are doing development the documentation is your best friend. You can always look at few examples. Its a straight forward task you just have to load the excel file and extract sheet by sheet and perform you task. You can also look on some tutorials for jxl they are very similar for beginners.

read/write to a large size file in java

i have a binary file with following format :
[N bytes identifier & record length] [n1 bytes data]
[N bytes identifier & record length] [n2 bytes data]
[N bytes identifier & record length] [n3 bytes data]
as you see i have records with different lengths. in each record i have N bytes fixed which contains and id and the length of data in record.
this file is very big and can contains 3 millions records.
I want to open this file by an application and let user to browse and edit the records.
( Insert / Update / Delete records)
my initial plan is to create and index file from original file and for each record, keep next and previous record address to navigate forward and backward easily. (some sort of linked list but in file not in memory)
is there library (java library) to help me to implement this requirement ?
any recommendation or experience that you think is useful?
----------------- EDIT ----------------------------------------------
Thanks for guides and suggestions,
some more info:
the original file and its format is out of my control (it's a third party file) and i can't change the file format. but i have to read it, let user to navigate over records and edit some of them (insert new record/ update an existing record/ delete a record) and at the end save it back to original file format.
do u still recommend DataBase instead of a normal index file ?
----------------- SECOND EDIT ----------------------------------------------
record size in update mode is fixed. it means updated (edited) record has same length as original record's, unless user delete the record and create another record with different format.
Many Thanks

Seriously, you should NOT be using a binary file for this. You should use a database.
The problems with trying to implement this as a regular file stem from the fact that operating systems do not allow you to insert extra bytes into the middle of an existing file. So if you need to insert a record (anywhere but the end), update a record (with a different size) or remove a record, you would need to:
rewrite other records (after the insertion/update/deletion point) to make or reclaim space, or
implement some kind of free space management within the file.
All of this is complicated and / or expensive.
Fortunately, there is a class of software that implements this kind of thing. It is called database software. There are a wide range of options, ranging from using a full-scale RDBMS to light-weight solutions like BerkeleyDB files.
In response to your 1st and 2nd edits, a database will still be simpler.
However, here's an alternative that might perform better for this use-case than using a DB... without doing complicated free-space management.
Read the file and build an in-memory index that maps ids to file locations.
Create a second file to hold new and updated records.
Perform the record adds/updates/deletes:
An addition is handled by writing the new record to the end of the second file, and adding an index entry for it.
An update is handled by writing the updated record to the end of the second file, and changing the existing index entry to point to it.
A delete is handled by deleting the index entry for the record's key.
Compact the file as follows:
Create a new file.
Read each record in the old file in order, and check the index for the record's key. If the entry still points to the location of the record, copy the record to the new file. Otherwise skip it.
Repeat the step 4.2 for the second file.
If we completed all of the above successfully, delete the old file and second file.
Note this relies on being able to keep the index in memory. If that is not feasible, then the implementation is going to be more complicated ... and more like a database.

Having a data file and an index file would be the general base idea for such an implementation, but you'd pretty much find yourself dealing with data fragmentation upon repeated data updates/deletion, etc. This kind of project, in itself, should be a separate project and should not be part of your main application. However, essentially, a database is what you need as it is specifically designed for such operations and use cases and will also allow you to search, sort, and extend (alter) your data structure without having to refactor an in-house (custom) solution.
May I suggest you to download Apache Derby and create a local embedded database (derby does it for you want you create a new embedded connection at run-time). It will not only be faster than anything you'll write yourself, but will make your application easier to maintain.
Apache Derby is a single jar file that you can simply include and distribute with your project (check the license if any legal issue may apply in your app). There is no need for a database server or third party software; it's all pure Java.
Bottom line as that it all depends on how large is your application, if you need to share the data across many clients, if speed is a critical aspect of your app, etc.
For a stand-alone, single user project, I recommend Apache Derby. For a n-tier application, you might want to look into MySQL, PostgreSQL or (hrm) even Oracle. Using already made and tested solutions is not only smart, but will cut down your development time (and maintenance efforts).
Cheers.

Generally you are better off letting a library or database do the work for you.
You may not want to have an SQL database and there are plenty of simple databases which don't use SQL. http://nosql-database.org/ lists 122 of them.
At a minimum, if you are going to write this I suggest you read the source for one of these databases to see how they work.
Depending on the size of the records, 3 million isn't that much and I would suggest you keep as much in memory as possible.
The problem you are likely to have is ensuring the data is consistent and recovering the data when a corruption occurs. The second problem is dealing with fragmentation efficiently (some thing the brightest minds working on the GC deal with) The third problem is likely to be maintain the index in a transaction fashion with the source data to ensure there are no inconsistencies.
While this may appear simple at first, there are significant complexities in making sure there data is reliable, maintainable and can be accessed efficiently. This is why most developers use an existing database/datastore library and concentrate on the features which are unqiue to their application.

(Note: My answer is about the problem in general, not considering any Java libraries or - like the other answers also proposed - using a database (library), which might be better than reinventing the wheel)
The idea to create an index is good and will be very helpful performance-wise (although you wrote "index file", I think it should be kept in memory). Generating the index should be quite fast if you read the ID and record length for each entry and then just skip the data with a file seek.
You should also think about the edit functionality. Especially inserting and deleting can be very slow on such a big file if you do it wrong (f.e. deleting and then moving all the following entries to close the gap).
The best option would be to only mark deleted entries as deleted. When inserting, you can overwrite one of those or append to the end of the file.

Insert / Update / Delete records
Inserting (rather than merely appending) and deleting records to a file is expensive because you have to move all the following content of the file to create space for the new record or to remove the space it used. Updating is similarly expensive if the update changes the length of the record (you say they are variable length).
The file format you propose is fundamentally unsuitable for the kinds of operations you want to perform. Others have suggested using a data-base. If you don't want to go that far, adding an index file (as you suggest) is the way to go. I recommend making the index records all the same length.

As others have stated a database would seem a better solution. The following are Java SQL DB's that could be used: H2, Derby or HSQLDB
If you want to use an index file look at Berkley DB or No Sql
If there is some reason for using a file, look at JRecord . It has
Several Classes for reading/writing files with variable length binary records (they where written for Cobol VB files). Any of Mainframe / Fujitsu / Open Cobol VB file structures should do the job.
An Editor for editing JRecord files. The latest version of the Editor can handle large files (it uses Compression / spill file). The editor suffers from having to download the whole file and only one user can edit the file at one time.
The JRecord solution will only work if
There is a limited number (preferably one) users all located in the one location
Fast infostructure

POI dynamic templates

Can anyone tell me where do I find some useful documentation on handling copying rows, cells, columns from one excel file to another, using POI?
I need to insert in one blank excel file, 2 or more templates from other files, dynamic.
I also need to keep all the styles made for the group of cells that I copy. How can I do that? Nothing found on apache poi tutorial on this point.
I am using POI 3.0.1.
Thank you!

I assume the problem is data types and merged cells? It's easy enough to get and set styles and set values.
Depending on your use case, you might be able to take entire sheets from the original document, assemble the new document from those and tweak it to your liking. Even if you have to combine multiple source sheets into one target sheet, you might still be able to retrieve source rows and assemble the target document from those rows.

...that was me some time ago...
I never could copy from one excel file to another with the exact style, but I found a solution : I used multiple worksheets instead of multiple excel files, cause style has no problem in being copied from one sheet to another as long as it is in the same workbook.
I also migrated from POI3.0.1 to POI 3.6. Far much better.

API to write huge excel files using java

I am looking to write to an excel (.xls MS Excel 2003 format) file programatically using Java. The excel output files may contain ~200,000 rows which I plan to split over number of sheets (64k rows per sheet, due to the excel limit).
I have tried using the apache POI APIs but it seems to be a memory hog due to the API object model. I am forced to add cells/sheets to the workbook object in memory and only once all data is added, I can write the workbook to a file! Here is a sample of how the apache recommends i write excel files using their API:
Workbook wb = new HSSFWorkbook();
Sheet sheet = wb.createSheet("new sheet");
//Create a row and put some cells in it
Row row = sheet.createRow((short)0);
// Create a cell and put a value in it.
Cell cell = row.createCell(0);
cell.setCellValue(1);
// Write the output to a file
FileOutputStream fileOut = new FileOutputStream("workbook.xls");
wb.write(fileOut);
fileOut.close();
Clearly, writing ~20k rows(with some 10-20 columns in each row) gives me the dreaded "java.lang.OutOfMemoryError: Java heap space".
I have tried increasing JVM initial heapsize and max heap size using Xms and Xmx parameters as Xms512m and Xmx1024. Still cant write more than 150k rows to the file.
I am looking for a way to stream to an excel file instead of building the entire file in memory before writing it to disk which will hopefully save a lot of memory usage. Any alternative API or solutions would be appreciated, but I am restricted to usage of java. Thanks! :)

Try to use SXSSF workbook, thats great thing for huge xls documents, its build document and don't eat RAM at all, becase using nio

All existing Java APIs try to build the whole document in RAM at once. Try to write an XML file which conforms to the new xslx file format instead. To get you started, I suggest to build a small file in the desired form in Excel and save it. Then open it and examine the structure and replace the parts you want.
Wikipedia has a good article about the overall format.

I had to split my files into several excel files in order to overcome the heap space exception. I figured that around 5k rows with 22 columns was about it, so I just made my logic so that every 5k row I would end the file, start a new one and just numerate the files accordingly.
In the cases where I had 20k + rows to be written I would have 4+ different files representing the data.

Have a look at the HSSF serializer from the cocoon project.
The HSSF serializer catches SAX events and creates a spreadsheet in the XLS format used by Microsoft Excel

There also is JExcelApi, but its uses more memory. i think you should create .csv file and open it in excel. it allows you to pass a lot of data, but you wont be able to do any "excel magic".

Consider using CSV format. This way you aren't limited by memory anymore --well, maybe only during prepopulating the data for CSV, but this can be done efficiently as well, for example querying subsets of rows from DB using for example LIMIT/OFFSET and immediately write it to file instead of hauling the entire DB table contents into Java's memory before writing any line. The Excel limitation of the amount rows in one "sheet" will increase to about one million.
That said, if the data is actually coming from a DB, then I would highly reconsider if Java is the right tool for this. Most decent DB's have an export-to-CSV function which can do this task undoubtely much more efficient. In case of for example MySQL, you can use the LOAD DATA INFILE command for this.

We developed a java library for this purpose and currently it is available as open source project https://github.com/jbaliuka/x4j-analytic . We use it for operational reporting.
We generate huge Excel files, ~200,000 should work without problems, Excel manages to open such files too.
Our code uses POI to load template but generated content is streamed directly to file without XML or Object model layer in memory.

Is this memory issue happen when you insert data into cell, or when you perform data computation/generation?
If you are going to load files into an excel that consist of predefined static template format, then better to save a template and reuse multiple time. Normally template cases happen when you are going to generate daily sales report or etc...
Else, every time you need to create new row, border, column etc from scratch.
So far, Apache POI is the only choice I found.
"Clearly, writing ~20k rows(with some 10-20 columns in each row) gives me the dreaded "java.lang.OutOfMemoryError: Java heap space"."
"Enterprise IT"
What YOU CAN DO is- perform batch data insertion. Create a queuetask table, everytime after generate 1 page, rest for seconds, then continue second portion. If you are worry about the dynamic data changes during your queue task, you can first get the primary key into the excel (by hiding and lock the column from user view). First run will be insert primary key, then second queue run onwards will read out from notepad and do the task portion by portion.

We did something quite similar, same amount of data, and we had to switch to JExcelapi because POI is so heavy on resources. Try JexcelApi, you won't regret it when you have to manipulate big Excel-files!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.