Multiple readers in single job in single step using spring batch - java

I am newbie in spring batch. I have a use case in which, I've to read files from a specific folder and write those files in to DB.
For example, I've a files In folder like this
-company_group
|
-my_company_group.json
-my_company_group_alternate_id.json
-sg_company_group.json
-sg_company_group_alternate_id.json
Note: sg = Singapore, my=Malaysia
Now, I want to read these files In the following order.
SG files should be read first than my files.
for each country alternate file should come first.
For example,
sg_company_group_alternate_id.json
sg_company_group.json
And same for the MY files
Currently, I'm reading all files by writing custom MultiResourcePartitioner and sorting my files order in the way which I mentioned above.
There will be 1 writer and reader for 1 file.
There will be 1 job.
Now, the problem Is I've a step in which I've a custom partitioner which I mentioned above it gets all file sort it but it goes in to only 1 reader. It will go through one reader for all files. I want multiple readers for all files.
I mean to say, in 1 job I've a step which loads all files. Now in this step, 1 file get read, write in db repeat for other file in same step.
As per my understanding spring batch do not allow multiple readers in 1 step.
Is there any workaround?
Thanks.

I would recommend to create a job instance per file, meaning pass the file as an identifying job parameter. This has at least two major benefits:
Scaling: you can run multiple jobs in parallel, each job processing a different file
Fault-tolerance: if a job fails, you only restart the failed job, without affecting other files

Related

Spring Batch: Create new steps at job runtime

Context: I am working on Spring Batch pipeline which will upload data to database. I have already figured out how to read flat .csv files and write items from it with JdbcBatchItemWriter. The pipeline I am working on must read data from Zip archive which contains multiple .csv files of different types. I'd like to have archive downloading and inspecting as two first steps of job. I do not have enough disk space to unpack the whole loaded archive. Instead of unpacking, I inspect zip file content to define .csv files paths inside Zip file system and their types. Zip file inspecting allows obtaining InputStream of corresponding csv file easily. After that, reading and uploading (directly from zip) all discovered .csv files will be executed in separate steps of the job.
Question: Is there any way to dynamically populate new Steps for each of discovered csv entries of zip at job runtime as result of inspect step?
Tried: I know that Spring Batch has conditional flow but it seems that this technique allows configuring only static number of steps that are well defined before job execution. In my case, the number of steps (csv files) and reader types are discovered at the second step of the job.
Also, there is a MultiResourceItemReader which allows to read multiple resources sequentially. But I'm going to read different types of csv files with appropriate readers. Moreover, I'd like to have "filewise" step encapsulation such that if one file loading step fails, others will be executed anyway.
The similar question How to create dynamic steps in Spring Batch does not have suitable solution for me as the answer supposes steps creation before job running, while I need to add steps as result of second step of job.
You could use partitioned steps
Pass variable containing list of csv as resources to
JobExecutionContext during your inspect step
In partition method retrieve the list of csv and create a partition for each one.
The step will be executed for each of the partition created

Apache Beam Global Counting

I am trying to understand the best way of solving the following:
As simple example scenario, I have a file which describes a test name and if its execution passed (true/false).
test-scenario,passed
--------------------
testA,true
testB,false
Using apache beam I can read, parse the file into PCollection<TestDetails> and then using subsequent transforms write all test details which have passed to one set of files and likewise for those tests which failed.
After writing the above files I would finally like to generate some counts regarding: the total number of file records processed, number tests that passed, number test that failed and write these details to a single file.
Should I use a combine global for this ?
For this purpose, you can use Beam Metrics (please, see the documentation). It provides counters, that can be used for the needs you described above, and then metrics can be fetched once your pipeline is finished. Please, take a look on this example. Also, Beam allows to export metrics into external sink, if it's more convenient.

Create output file for each of input file from directory of files in Spring Batch

I have a directory of CSV files which contains transaction information.
I need to read each file , process some business logic (validate with DB to check valid account and transaction or not) and write the valid transactions into new output file.
Input:
Tranx_100.csv, Tranx_101.csv, Tranx_102.csv....
Ouput:
Tranx_100_output.csv, Tranx_101_output.csv, Tranx_102_output.csv....
Want to use spring batch for this. Any suggestions on how to implement this?
For each file as input, process, output are same - can i run them as part of 'Step' and repeat the step for each input file in a JOB?
Instead of looping on the same step or using multiple steps in a single job, I would use a job instance for each file (ie the file is an identifying job parameter) and launch jobs in parallel. This approach has multiple advantages:
Fault-tolerance: In case of failure, only the failed file is re-processed
Scalability: You can run multiple jobs in parallel easily
Logging: logs will be separated by design
And all the good reasons of making one thing do one thing and do it well

When mutiple MapReduce jobs are chained, is the output of each written to the HDFS?

Let us say multiple MapReduce jobs are chained, such as shown below.
Map1-Reduce1 -> Map2-Reduce2 -> ... -> MapN-ReduceN
Would the output of each MapReduce job be written to the HDFS? For example, would the output of Map1-Reduce1 be written to the HDFS? And In case of failure of tasks in Map2-Reduce2, can Map2-Reduce2 restart by reading the output of Map1-Reduce1, which is already in the HDFS.
You can achieve this by extending the Configured class and writing multiple job configurations, i.e., one for each M-R. The outputpath of one m-r instances will be served as as input to the second.
Yes you can use oozie to serialize your output from one MapR to another via HDFS. You should checkout ChainMapper class in Hadoop.
You can either use oozie or Spring Batch both are suited for your solution. You can write the output of each step to HDFS and read back in next Map Reduce Job.

Java questions CSV/batch/js in jar

i have multiple questions here that may sound annoying...
What is batch processing in java is it related to .bat files and how to write batch files?
How to read CSV files in java? and what are CSV Files how do we clarify which value depicts which thing?
can we include js files in jar ? if yes then how ?
how to compile a java file from command prompt and mention the jar used by it.
1) What is batch processing in java is it related to .bat files and how to write batch files?
Batch Processing is not Java specific. It is explained pretty well in this Wikipedia article
Batch processing is execution of a series of programs ("jobs") on a computer without manual intervention.
Batch jobs are set up so they can be run to completion without manual intervention, so all input data is preselected through scripts or command-line parameters. This is in contrast to "online" or interactive programs which prompt the user for such input. A program takes a set of data files as input, processes the data, and produces a set of output data files. This operating environment is termed as "batch processing" because the input data are collected into batches of files and are processed in batches by the program.
There are different ways to implement batch processing in Java, but I guess the most powerful library available is Spring Batch (but it has a steep learning curve). Batch processing is only marginally related to windows .bat batch files.
2) How to read CSV files in java? and what are CSV Files how do we clarify which value depicts which thing?
When dealing with CSV (or other structured data, like XML, JSON or database contents), you usually want to map the data to Java objects, so you need a library that does Object mapping. For CSV, OpenCSV is such a library (see this section on Java bean mapping).
3) can we include js files in jar ? if yes then how ?
see gdj's answer. You can put anything in a jar, but resources in a jar will not be available as File objects, only as InputStream using the Class.getResourceAsStream(name) or ClassLoader.getResourceAsStream(name) methods
Batch processing is not a Java specific term. Whenever you perform an action on a group of objects/files, we can term it as batch processing. ".bat" files are Windows equivalent of shell scripts. They do not have any connection to Java or batch processing in Java.
CSV are "Comma Separated Values" i.e each column in a line of the file is delimted by "comma". You can read CSV files using normal FileReader and then using StringTokenizer to parse through each line.
I guess we could include anything in a Jar file. I don't see how it is going to prevent that.
there is no direct relationship between java and bat. Batch files are files written in windows shell language. Sometimes we use bat files to run our java programs on windows. In this case batch file typically is used to generate the java command line like
java -cp THE-CLASSPATH com.mycompany.Main arg1 arg2
You can read CSV file as a regular text file and then split it using String.split() method. Alternatively you can use one of available open source CSV parsers, e.g. from jakarta: http://commons.apache.org/sandbox/csv/apidocs/org/apache/commons/csv/CSVParser.html
JAR file is just a ZIP file. You can include everything into ZIP including js. How to do this? It depends on how do you create jar file at all. If for example you are using ant script just include *.js into the include pattern.
If you need more specific answer ask more specific question.
processing a lot of data at once
CSV is comma-separated values, a file format. try the OpenCSV library
yes but you can only read them from Java code (you can't tell Apache to serve them directly over HTTP)

Categories

Resources