I have and XML file with 60k entities. I want to process it in batches of 20k. I am using SAX parser to parse the entities and store it in a list.
I parsed 60k entities and stored it in a file/array/list and then process each separately. I dont think it is the best solution.
Is there any way to read only 20k entities from the XML file, process them and read the XML file again.
I think you can use Multithreading concept. Create 3 thread, each thread has to read 20K data. Then another thread will read another 20k data.
Related
I am very new to Java and have been tasked to use spring batch to read in some text files. So far Spring batch resources online have helped me to get to a point where I am reading, processing and writing some simple test .csv files into Mongo.
The problem I have now is that the actual file I would like to read from has over 600 columns. Meaning that with the current way I am reading in my file to Java, I would need 600+ fields in my #Document mongo model.
I have been thinking of a couple of ways to get around this,
first I was thinking maybe I could read in each line as a string and then in my processor deal with splitting everything up and formatting the data to then return a list of my MongoTemplate but returning a List is not viable from the overridden process method.
So my question to you guys is,
What is the best way to handle reading in files with hundreds of
columns in spring batch? Or what would be the best resource to start
reading to help point me in the right direction.
Thanks!
I had a same problem I used
http://opencsv.sourceforge.net/apidocs/com/opencsv/CSVReader.html
for reading csvs.
I suggest you use Map instead of 600 java fields.
Besides, 600X600 java strings is not a big deal for java and neither for mongo.
To manipulate with mongo use http://jongo.org/
If you really need batch processing of data.
Your flow for batches should be,
Loop here : divide in batches(say 300 per loop)
Read 300X300 java objects(or in a Map) from file in memory.
Sanitize or Process them if needed.
Store in mongoDB.
return if EOF.
I ended up just reading in each line as a String object. Then in the processor looping over the String object with a delimiter creating my Mongo repository objects and storing them. So I am basically doing all of the writing inside the processor method which I would say is definitely not best practice but gives me the desired end result.
I have following two requirements:
To read a CSV file and put rows line by line into the database (RDSMS) without any data manipulation.
To read a CSV file and put this data into the database (RDBMS). In this case, row Z might be dependent on row B. So need to have a staging DB (in-memory or another a staging RDBMS)
I am analyzing multiple ways to accomplish this:
Using Core java, and read file in Producer-consumer way.
Using Apache Camel and BeanIO to read the csv file.
Using SQL to read the file.
Wanted to know, if is there an already industry defined preferred way to do such kind of tasks?
I found few links on stackoverflow, but I am looking for more options:
How to read a large text file line by line using Java?
Read a huge file of numbers in Java in a memory-efficient way?
Read large CSV in java
I am using Java6 for implementation.
you should use NIO package to do such stuff in GBs. NIO is asynchronous, fastest till date and most reliable. you can simple read files in chunks via NIO packaging and then insert into db using bulk commands rather than single insertion. Single insertion take lot of your CPU cycles and may cause OOM errors.
We are using Apache Camel's "File:" protocol to read file and process the data.
You can use RandomAccessFile for reading csv file, it gives you fast enough read speed, it does not requires any extra jar file, here is code,
File f=new File(System.getProperty("user.home")+"/Desktop/CSVDOC1.csv");
RandomAccessFile ra = new RandomAccessFile(f,"rw");
ra.seek(0);//Read from start
long p=ra.getFilePointer();
String d= ra.readLine();
ra.seek(p);
while(d!=null){
//Each line data stored in variable d
d=ra.readLine();
//d="col1","col2","col2","col3"
//Separate line data by separator ","
//insert row values into database
}
//Release file lock
ra.close();
When processing a step level using chunk processing(specifying a commit-interval) in Spring Batch,is there a way to know inside the Writer,when all the records in a file have been read and processed.My idea was to pass the collection of records read from the file to the ExecutionContext once all the records have been read.
Please help.
I don't know if the is one of pre-built CompletionPolicy that do what you want, but if none you can write a custom CompletionPolicy that mark a chunk as completed when writer return null; in this way you hold all items read from file.
After that, are you sure this is exactly what you wnat? because store all item in ExecutionContext is not a good pratice; also you will lose chunk processing, restartability, and all other SB features...
I have some logic already implemented that allows me to query for a set of records which contain an XML column. For each record I am to save the XML column data as a separate XML file as well as produce a CSV file of the same data. The problem is I can only do this for a limit number of records as I start to run out of memeory if too many records are being processed.
Currently the approach is:
To write to CSV, I am taking the InputStream for each XML column returned (using SQLXML objects for this) from query and parsing the inputstream using a DOM Parser along with some XPath in order to identify just the elements which contain any sort of text (child nodes, I don't really care about parents). Then using the element name as the header and text as the value in the csv file. The data is then being written to a file using a BufferedWriter along with StringBuilders to hold text (one for header, one for values).
Saving the data to XML file is being accomplished by taking the same InputStream mentioned above and converting it to a String and then finally writing it out to file using BufferedWriter. This really isn't the nicest looking at the end as the entire XML data is placed on a single line when it's written to the file, but it works for now.
I am quite new when it comes to working with XML in Java, so just wanting to get any advice or input on possible alternatives or more effecient practices to do this same process. Would like to be able to process at least 70 records at once.
Thanks in advance.
Hi I am doing POC/base for design on reading database and writing into flat files. I am struggling on couple of issues here but first I will tell you the output format of flat file
Please let me know how do design the input writer where I need to read the transactions from different tables, process records , figure out the summary fields and then how should I design the Item Writer which has such a complex design. Please advice. I am successfully able to read from single table and write to file but the above task looks complex.
Extend the FlatFileItemWriter to only open a file once and append to it instead of overwriting it. Then pass that same filewriter into multiple readers in the order you would like them to appear. (Make sure that each object read by the readers are extensible by something that the writer understands! Maybe interface BatchWriteable would be a good name.)
Some back-of-the-envelope pseudocode:
Before everything starts:
Open file.
Write file headers.
Start Batch step
implement as many times as necessary
Read Batch section
Process Batch section
A Write Batch section
when done:
Write file footer
Close file