Spring Batch CSV to Mongodb - 100s of columns

Spring Batch CSV to Mongodb - 100s of columns - java

I am very new to Java and have been tasked to use spring batch to read in some text files. So far Spring batch resources online have helped me to get to a point where I am reading, processing and writing some simple test .csv files into Mongo.
The problem I have now is that the actual file I would like to read from has over 600 columns. Meaning that with the current way I am reading in my file to Java, I would need 600+ fields in my #Document mongo model.
I have been thinking of a couple of ways to get around this,
first I was thinking maybe I could read in each line as a string and then in my processor deal with splitting everything up and formatting the data to then return a list of my MongoTemplate but returning a List is not viable from the overridden process method.
So my question to you guys is,
What is the best way to handle reading in files with hundreds of
columns in spring batch? Or what would be the best resource to start
reading to help point me in the right direction.
Thanks!

I had a same problem I used
http://opencsv.sourceforge.net/apidocs/com/opencsv/CSVReader.html
for reading csvs.
I suggest you use Map instead of 600 java fields.
Besides, 600X600 java strings is not a big deal for java and neither for mongo.
To manipulate with mongo use http://jongo.org/
If you really need batch processing of data.
Your flow for batches should be,
Loop here : divide in batches(say 300 per loop)
Read 300X300 java objects(or in a Map) from file in memory.
Sanitize or Process them if needed.
Store in mongoDB.
return if EOF.

I ended up just reading in each line as a String object. Then in the processor looping over the String object with a delimiter creating my Mongo repository objects and storing them. So I am basically doing all of the writing inside the processor method which I would say is definitely not best practice but gives me the desired end result.

Related

How to compare two big unsorted CSV files using Spring Batch?

I have a task of comparing two big csv files and write out the comparison result to a new file. File 1 has 200k rows and file 2 could also have 200K or less than that. Both have 200 columns. The files are not sorted and can be in any order. I am using Java 8 and Spring Version 4.
Question
I am using Spring Batch in my project, is there any way I can achieve this using Spring Batch customized ItemReader and ItemWriter or should I use a tasklet and then plain Java code to compare the files? I also wanted to do it in the fastest way. The volume of the data will be really huge may be 2-4 Gigs so I don't want to load it in the memory. The file structures are something like the below.
File1:
regn_nbr,name,address1,countrycode,regn_date
2345,John,4332 JFK Boulevard,US,02-12-2011
2347,mark,4332 Maryland Avenue,US,04-27-2015
2348,Smith,4332 JFK road,US,07-30-2011
2302,Andy,4332 JFK lane,US,06-01-2010
File2:
regn_nbr,name,address1,countrycode,regn_date
2345,John,4332 JFK Boulevard,US,02-12-2011
2302,Andy,4332 JFK lane,US,06-01-2010
2911,Peter,12 candle drive,MX,01-01-2010
2348,Smith,4332 JFK road,US,07-30-2011
2347,mark,4332 Maryland Avenue,US,04-27-2015
Your suggestions, different approaches, strategies and expertise are most welcome.

are you sure you need a special program for that?
i would try it with
Database bulk load (e.g. for mysql load data from file)
run compare script afterwards and get result into file (e.g. for mysql select data into file)
if memory really is your primary concern, well all it needs is a some java main class, some java nio and simple java sql

I think that the best way is reading files and creating two list of a specific java bean that represents the structure of your file. These bean can implements Comparable and you can write a method that can order and compare the lists with specific rules written by you.

How to define multiple threads in MultiItemResource Reader?

I am using MultiResourceItemReader class of Spring Batch. Which uses FlatFileReader bean as delegate.My files contains XML requests, my batch reading requestes from files hit its on to URL and writing response to corresponding output files. I want to define one thread for each file processing to decrease execution time. In my current requirement I have four input files , I want to define four thread to read ,process and write files. I tried with simpleTaskExecuter with
task-executor="simpleTaskExecutor" throttle-limit="20"
But after using this flatfileReader is throwing Exception.
I am beginner, please suggest me how to implement this. Thanks in advance.

There are a couple ways to go here. However, the easiest way would be to partition by file using the MultiResourcePartitioner. That in combination with the TaskExecutorPartitionHandler will give you reliable parallel processing of your input files. You can read more about partitioning in section 7.4 of our documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/scalability.html

how to implement RowLoader in gemfirexd?

How to write Rowloader JAVA code to inject data from sample.csv file into GenfireXD database.

The GemFireXD distribution includes a JDBCRowLoader source example. Look in the examples directory. In your case you will have to determine which field of your CSV you want to consider as primary keys, parse the CSV and return rows as needed.

You can check IMPORT_DATA_EX and IMPORT_TABLE_EX procedures to load data into GemFireXD.
Since you mentioned csv format IMPORT_DATA_EX might be the recommend way to do it since you can also tweak the number of threads and constraints while loading the data. It's definitely one of the fastest ways to do it but please note that the csv file but be available from the node you're issuing the command.
You might also want to consider starting a peer member with host-data=false.
Reference: http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#reference/system_procedures/derby/rrefimportdataproc_ex.html

What is the fastest file / way to parse a large data file?

So I am working on a GAE project. I need to look up cities, Country Names and Country Codes for sign ups, LBS, ect ...
Now I figured that putting all the information in the Datastore is rather stupid as it will be used quite frequently and its gonna eat up my datastore quotations for no reason, specially that these lists arent going to change, so its pointless to put in datastore.
Now that leaves me with a few options:
API - No budget for paid services, free ones are not exactly reliable.
Upload Parse-able file - Favorable option as I like the certainty that the data will always be there.
So I got the files needed from GeoNames (link has source files for all countries in case someone needs it). The file for each country is a regular UTF-8 tab delimited file which is great.
However, now that I have the option to choose how to format and access the data, the question is:
What is the best way to format and retrieve data systematically from a static file in a Java servelet container ?
The best way being the fastest, and least resource hungry method.
Valid options:
TXT file, tab delimited
XML file Static
Java Class with Tons of enums
I know that importing country files as Java Enums and going through their values will be very fast, but do you think this is going to affect memory beyond reasonable limits ? On the other hand, every time I need to access a record, the loop will go through a few thousand lines until it finds the required record ... reading line by line so no memory issues, but incredibly slow ... I have had some experience with parsing an excel file in a Java servelet and it took something like 20 seconds just to parse 250 records, on large scale, response time WILL timeout (no doubt about it) so is XML anything like excel ??
Thank you very much guys !! Please provide opinions, all and anything is appreciated !

Easiest and fastest way would be to have the file as a static web resource file, under the WEB-INF folder and on application startup, have a context listener to load the file into memory.
In memory, it should be a Map, mapping from a key you want to search by. This will allow you like a constant access time.
Memory consumption would only matter if it is really big. A hundred thousand record for example not worth optimizing if you need to access this many times.
The static file should be plain text format or CSV, they are read and parsed most efficiently. No need XML formatting as parsing it would be slow.
If the list is really big, you can break it up into multiple, smaller files, and only parse those and only when they are required. A reasonable, easy partitioning would be to break it up by country, but any other partitioning would work (like based on its name using the first few characters from its name).
You could also consider building this Map in the memory once, and then serialize this map to a binary file, and include that binary file as a static resource file, and that way you would only have to deserialize this Map and would be no need to parse/process it as a text file and build objects yourself.
Improvements on the data file
An alternative to having the static resource file as a text/CSV file or a serialized Map
data file would be to have it as a binary data file where you could create your own custom file format.
Using DataOutputStream you can write data to a binary file in a very compact and efficient way. Then you could use DataInputStream to load data from this custom file.
This solution has the advantages that the file could be much less (compared to plain text / CSV / serialized Map), and loading it would be much faster (because DataInputStream doesn't use number parsing from a text for example, it reads the bytes of a number directly).

Hold the data in source form as XML. At start of day, or when it changes, read it into memory: that's the only time you incur the parsing cost. There are then two main options:
(a) your in-memory form is still an XML tree, and you use XPath/XQuery to query it.
(b) your in-memory form is something like a java HashMap
If the data is very simple then (b) is probably best, but it only allows you to do one kind of query, which is hard-coded. If the data is more complex or you have a variety of possible queries, then (a) is more flexible.

Java Parsing Framework for complex CSV files

I need to parse complex (non fixed length) csv files to Java objects in order to compare its values.
I first tried the Flatform Parsing Framework, i liked the approach of describing the values in an extra (xml) document. Maybe it's the right tool for simple csv (and also flat) files. Nevertheless my csv files contains lines that vary in quantity of fields - sometimes they span across multiple lines. There are also dependencies among those fields.
Here's a little sample: (each type has a certain amount of extra parameters)
; <COMMENTS (to be ignored)>
<NAME>,<TYPE_A>,<DESCRIPTION>,<PARAMETER>
<NAME>,<TYPE_B>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_C>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>,<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_D>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>,<PARAMETER>,<PARAMETER>, -
<PARAMETER>,<PARAMETER>, -
<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_B>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_A>,<DESCRIPTION>,<PARAMETER>
So i need something to describe and parse the csv file in a more complex manner. I'm new to this, I've heard about parser generator - is that what I need?

Try OpenCSV (see http://opencsv.sourceforge.net/#what-features). It handles embedded carriage returns just fine.

One option is to use the Scanner class or you might want to check out the Spring Batch. Ive never actually used SB but given batch jobs often read from simple text files i believe i read it caters for this including all sorts of object mapping.

You may also try japaki

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.