Read .csv file in GB's using Java

Read .csv file in GB's using Java - java

I have following two requirements:
To read a CSV file and put rows line by line into the database (RDSMS) without any data manipulation.
To read a CSV file and put this data into the database (RDBMS). In this case, row Z might be dependent on row B. So need to have a staging DB (in-memory or another a staging RDBMS)
I am analyzing multiple ways to accomplish this:
Using Core java, and read file in Producer-consumer way.
Using Apache Camel and BeanIO to read the csv file.
Using SQL to read the file.
Wanted to know, if is there an already industry defined preferred way to do such kind of tasks?
I found few links on stackoverflow, but I am looking for more options:
How to read a large text file line by line using Java?
Read a huge file of numbers in Java in a memory-efficient way?
Read large CSV in java
I am using Java6 for implementation.

you should use NIO package to do such stuff in GBs. NIO is asynchronous, fastest till date and most reliable. you can simple read files in chunks via NIO packaging and then insert into db using bulk commands rather than single insertion. Single insertion take lot of your CPU cycles and may cause OOM errors.

We are using Apache Camel's "File:" protocol to read file and process the data.

You can use RandomAccessFile for reading csv file, it gives you fast enough read speed, it does not requires any extra jar file, here is code,
File f=new File(System.getProperty("user.home")+"/Desktop/CSVDOC1.csv");
RandomAccessFile ra = new RandomAccessFile(f,"rw");
ra.seek(0);//Read from start
long p=ra.getFilePointer();
String d= ra.readLine();
ra.seek(p);
while(d!=null){
//Each line data stored in variable d
d=ra.readLine();
//d="col1","col2","col2","col3"
//Separate line data by separator ","
//insert row values into database
}
//Release file lock
ra.close();

Related

How to delete a record and keep on reading a file?

I have to read a sequential file which has over a million of records. I have to read each line/record and have to delete that record/line from the file and keep on reading.
Not finding any example on how to do that without using temporary file or creating/recreating a new file of the same name.
These are text files. Each file is about .5 GB big and we have over a million lines/records in each file.
Currently we are copying all the records to memory as we do not want to re-process any record if any thing happens in the middle of the processing of a file.

Assuming that the file in question is a simple sequential file - you can't. In the Java file model, deleting part of a file implies deleting all of it after the deletion point.
Some alternative approaches are:
In your process copy the file, omitting the parts you want deleted. This is the normal way of doing this.
Overwrite the parts of the file you want deleted with some value that you know never occurs in the file, and then at a later date copy the file, removing the marked parts.
Store the entire file in memory, edit it as required, and write it again. Just because you have a million records doesn't make that impossible. If your files are 0.5GB, as you say, then this approach is almost certainly viable.
Each time you delete some record, copy all of the contents of the file after the deletion to its new position. This will be incredibly inefficient and error-prone.
Unless you can store the file in memory, using a temporary file is the most efficient. That's why everyone does it.
If this is some kind of database, then that's an entirely different question.
EDIT: Since I answered this. comments have indicated that what the user wants to do is use deletion to keep track of which records have already been processed. If that is the case, there are much simpler ways of doing this. One good way is to write a file which just contains a count of how many bytes (or records) of the file have been processed. If the processor crashes, update the file by deleting the records that have been processed and start again.

Files are unstructured streams of bytes; there is no record structure. You can not "delete" a "line" from an unstructured stream of bytes.
The basic algorithm you need to use is this:
create temporary file.
open input file
if at the end of the file, goto line 7
read a line from the input file
if the line is not to be deleted, write it to the temporary file
goto line 3
close the input file.
close the temporary file.
delete (or just rename) the input file.
rename (or move) the temporary file to have the original name of the input file.

There is a similar question asked, "Java - Find a line in a file and remove".
Basically they all use a temp file, there is no harm doing so. So why not just do it? It will not affect your performance much and can avoid some errors.

Why not a simple sed -si '/line I want to delete/d' big_file?

Read big blob (excel type) data sequentially

I hav a big excel file that is stored as a blob in the database (Oracle).
If I read the whole file into the memory I get OutOfMemoryException.
My solution looks like this: read one page of data from the database and parse it line by line (and here I mean Excel line - so the framework should be able so convert only parts of a spreadsheet). Then check the last whole line that has been read (the size of the lines can vary, so I can not predict the number of bytes in a line, so the chunks would have the last line cut in two parts) and get the next chunk from there.
I need an open source java Excel library that can accomplish this.
Questions:
do you have a better solution?
if not which libraries can read an excel sequentially?

Read and process large one line JSON file

I have a large json file(200 MB), but all are in one single line.
I need to do some processing with the data in the file and write the data in to a relational database.
What is the best way we can do this using java.
Note: Most of the available methods are using line by line reading. Also We can use thing like MappedByteBuffer to read by characters but it is not an efficient solution
Non java solutions are also welcome

I recommend you the library from Douglas Crackford https://github.com/douglascrockford/JSON-java, use the following command to load a json array.
org.json.JSONArray mediaArray = new org.json.JSONArray(filecontent);
Check the following article for read a file content.
http://www.javapractices.com/topic/TopicAction.do?Id=42

Spring Batch: Reading database and writing into multi record Flat files

Hi I am doing POC/base for design on reading database and writing into flat files. I am struggling on couple of issues here but first I will tell you the output format of flat file
Please let me know how do design the input writer where I need to read the transactions from different tables, process records , figure out the summary fields and then how should I design the Item Writer which has such a complex design. Please advice. I am successfully able to read from single table and write to file but the above task looks complex.

Extend the FlatFileItemWriter to only open a file once and append to it instead of overwriting it. Then pass that same filewriter into multiple readers in the order you would like them to appear. (Make sure that each object read by the readers are extensible by something that the writer understands! Maybe interface BatchWriteable would be a good name.)
Some back-of-the-envelope pseudocode:
Before everything starts:
Open file.
Write file headers.
Start Batch step
implement as many times as necessary
Read Batch section
Process Batch section
A Write Batch section
when done:
Write file footer
Close file

API to write huge excel files using java

I am looking to write to an excel (.xls MS Excel 2003 format) file programatically using Java. The excel output files may contain ~200,000 rows which I plan to split over number of sheets (64k rows per sheet, due to the excel limit).
I have tried using the apache POI APIs but it seems to be a memory hog due to the API object model. I am forced to add cells/sheets to the workbook object in memory and only once all data is added, I can write the workbook to a file! Here is a sample of how the apache recommends i write excel files using their API:
Workbook wb = new HSSFWorkbook();
Sheet sheet = wb.createSheet("new sheet");
//Create a row and put some cells in it
Row row = sheet.createRow((short)0);
// Create a cell and put a value in it.
Cell cell = row.createCell(0);
cell.setCellValue(1);
// Write the output to a file
FileOutputStream fileOut = new FileOutputStream("workbook.xls");
wb.write(fileOut);
fileOut.close();
Clearly, writing ~20k rows(with some 10-20 columns in each row) gives me the dreaded "java.lang.OutOfMemoryError: Java heap space".
I have tried increasing JVM initial heapsize and max heap size using Xms and Xmx parameters as Xms512m and Xmx1024. Still cant write more than 150k rows to the file.
I am looking for a way to stream to an excel file instead of building the entire file in memory before writing it to disk which will hopefully save a lot of memory usage. Any alternative API or solutions would be appreciated, but I am restricted to usage of java. Thanks! :)

Try to use SXSSF workbook, thats great thing for huge xls documents, its build document and don't eat RAM at all, becase using nio

All existing Java APIs try to build the whole document in RAM at once. Try to write an XML file which conforms to the new xslx file format instead. To get you started, I suggest to build a small file in the desired form in Excel and save it. Then open it and examine the structure and replace the parts you want.
Wikipedia has a good article about the overall format.

I had to split my files into several excel files in order to overcome the heap space exception. I figured that around 5k rows with 22 columns was about it, so I just made my logic so that every 5k row I would end the file, start a new one and just numerate the files accordingly.
In the cases where I had 20k + rows to be written I would have 4+ different files representing the data.

Have a look at the HSSF serializer from the cocoon project.
The HSSF serializer catches SAX events and creates a spreadsheet in the XLS format used by Microsoft Excel

There also is JExcelApi, but its uses more memory. i think you should create .csv file and open it in excel. it allows you to pass a lot of data, but you wont be able to do any "excel magic".

Consider using CSV format. This way you aren't limited by memory anymore --well, maybe only during prepopulating the data for CSV, but this can be done efficiently as well, for example querying subsets of rows from DB using for example LIMIT/OFFSET and immediately write it to file instead of hauling the entire DB table contents into Java's memory before writing any line. The Excel limitation of the amount rows in one "sheet" will increase to about one million.
That said, if the data is actually coming from a DB, then I would highly reconsider if Java is the right tool for this. Most decent DB's have an export-to-CSV function which can do this task undoubtely much more efficient. In case of for example MySQL, you can use the LOAD DATA INFILE command for this.

We developed a java library for this purpose and currently it is available as open source project https://github.com/jbaliuka/x4j-analytic . We use it for operational reporting.
We generate huge Excel files, ~200,000 should work without problems, Excel manages to open such files too.
Our code uses POI to load template but generated content is streamed directly to file without XML or Object model layer in memory.

Is this memory issue happen when you insert data into cell, or when you perform data computation/generation?
If you are going to load files into an excel that consist of predefined static template format, then better to save a template and reuse multiple time. Normally template cases happen when you are going to generate daily sales report or etc...
Else, every time you need to create new row, border, column etc from scratch.
So far, Apache POI is the only choice I found.
"Clearly, writing ~20k rows(with some 10-20 columns in each row) gives me the dreaded "java.lang.OutOfMemoryError: Java heap space"."
"Enterprise IT"
What YOU CAN DO is- perform batch data insertion. Create a queuetask table, everytime after generate 1 page, rest for seconds, then continue second portion. If you are worry about the dynamic data changes during your queue task, you can first get the primary key into the excel (by hiding and lock the column from user view). First run will be insert primary key, then second queue run onwards will read out from notepad and do the task portion by portion.

We did something quite similar, same amount of data, and we had to switch to JExcelapi because POI is so heavy on resources. Try JexcelApi, you won't regret it when you have to manipulate big Excel-files!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.