I have scenario where my Spring batch job is running every 3 mins.
Steps should be
Each user's record should get executed parallel. Each user can have maximum of 150k records.
Every user can have update and delete records. Update records should run before delete.
Update/delete sets should run parallel on their own. But strictly all updates should complete before delete.
Can anyone suggest the best approach to achieve the parallelism at multiple levels and follow the order at update and delete level.
I am looking something around Spring Async Executor Service, Parallel Streams and other Spring libraries. Rx, only if it gives some glaring performance which the above specified can't provide.
Glaring performance is based on the design of spring batch implementation and we are sure you will get with spring batch as we are processing millions of records with select delete and update.
Each user's record should get executed parllely. Each user can have maximum of 1.5 lakh records.
"Partition the selection based on User and Each user will run as parallel steps."
Every user can have update and delete records. Update records should run before delete.
" Create a Composite Writer and delegates added for update Ist writer and delete 2nd writer "
Update/delete sets should run parallel on their own. But strictly all updates should complete before delete.
"Each writer step update and delete manages the transaction and make sure update executes first ".
Please refer below
Spring Batch multiple process for heavy load with multiple thread under every process
Composite Writer Example
Spring Batch - Read a byte stream, process, write to 2 different csv files convert them to Input stream and store it to ECS and then write to Database
Related
So I have a spark application that reads DB records (lets say 1000 records), processes them, and writes a CSV file (with 1000 lines) out to the cloud Object storage. So three questions here:
Is DB read request sent to executors? If so in case of 1000 DB records, would each executor read partial DB data (example 500 records each) and send the records back to driver? Or does it write to a central cache and driver would read it from there?
Next step processing the DB records (fold job), is sent to 2 executors. Lets say each executor gets 500 records or so. Once the executor finishes processing its partition does it send all 500 processed (formatted) rows back to driver? Or does it write some central cache and driver gets it back? How is the data exchange happening between driver and executor happen?
Last step is the .save csvfile call in my main() function. In this code I am doing a reparition(1) with the idea that I will only save this file from one executor. If so, how is the data collected into this one executor. Remember earlier we had two executors process 500 records each. How is a total of 1000 records sent to one executor and gets saved into the object storage by one executor? how is the data collected from all executors shared into that one executor executing the .save?
dataset.repartition(1)
.write()
.format("csv")
.option("header", "true")
.save(filepath);
If I dont do repartition(1), will the save happen from multiple executors and would it overwrite each other? I dont think there is a way we can specify the filename to be unique using spark. Do I have to save the file in temp and rename later and all that?
Are there any articles, youtube videos that will explain how data is distributed and collected or shared across executors? I can understand how .count() works. but how does .save work or how is large data results like millions of DB records or rows shared across executors? I have been looking for resources to read can't seem to find one that answers my questions. I am very new to spark, like 3 weeks new.
I have two DB's where i use DB link to connect to DB2 from Db1.
I am using JDBC, C3Po jar, oracle 11g. This is a batch job runs every day.
At first run which is successful there are around 400k records inserted during merge command, during second run (its a daily job) we are facing issues. I guess the issue is because of merge query? merge has condition, if exists update otherwise insert. most likely it will re-run 400k which are mostly identical(for now) so is this looking up and update/insert causing this problem?
This is how my logic looks
method(){
1.select query where DBlink involved, contains Db1 and Db2 tables
2.iterate the result and using batch updates
save (using MERGE) the result from step 1 to a table which is not used in step 1 query.
we are dealing with over 100k records here.
3. stmt.executeBatch();//This is where the sql exception is occurring
closeResultSet(rs);
clearBatch(stmt);
closePreparedStatement(stmt);
closePreparedStatement(pstmt);//select statement
}
There is no closing of DBlink in my code. this above logic will run on 8 (greater or less) threads parallely with different inputs each having its own connection to db.
here are my thoughts and doubts
i think each thread will create a DBlink
do i need to increase the shared poll size as suggested by this link
here?
If it is because of Dblink why am i facing exception while
executeBatch() where there is no use of Dblink?
I do not think closing the DBlink will work in my case. since these
are running parallely they will mostly reach to the code at the same
time, do i need change the configuration to have more Dblinks?
We have a requirement to read a CSV file and for each line read, there shall be a REST call. The REST call returns an array of elements, for each element returned there shall be an update in DB.
There is an NFR to achieve parallel processing in the above requirement.,
After reading the CSV, each individual line processing has to be parallel i.e., there shall be a group of workers making concurrent REST calls per each line read.
Subsequently, for each array element found in the response of the REST call, there shall be parallel updates to the DB as well.
Any valuable suggestions / ideas on how to achieve this requirement in Spring Batch?
We have thought of 2 approaches, the first one is to go with Master / Worker design for doing REST calls on each CSV line that is read.
Each worker here will correspond to one line read from the CSV file, they shall perform the REST call and when the response is returned, each of these workers shall become a master themselves and launch another set of workers based on the number of array elements returned from the REST call.
Each worker then launched shall handle one element of the response returned above and will perform DB updates in parallel.
Is this even possible and a good solution?
The second one is to use JobStep based approach i.e., to launch a Job from another. If we have to follow this approach, how do we communicate the data between 2 Jobs? i.e., Suppose our first job (Job1) reads the CSV and makes a REST call, the second Job (Job2) shall be responsible for processing the individual response elements from the REST call. In that case, how do we communicate the response element data from the Job1 to Job2. In this scenario, can we have Job1 launching multiple Job2 instances for parallel DB updates?
The solutions we have thought may not be very clear and confusing, we are not sure if these solutions are even correct and feasible.
Apologies for any confusions caused. But we are clueless on how to come up with the design for this requirement.
In either case, we are not clear on how the failures will be tracked and how the job can be re-run from the failed state.
Any help is much appreciated!!
Suppose i have 10 records and some of them are corrupted records so how spring will handle restart.
Example suppose record no. 3& 7 are corrupt and they go to different reducer then how spring will handle the restart
1.how it will maintain the queue to track where it last failed.
2.what are the different ways we can solve this one
SpringBatch will do exactly what you tell SpringBatch to do.
Restart for SpringBatch means run the same job that failed with the same set of input parameters. However, the new instance (execution) of this job will be created.
The job will run on the same data set that the failed instance of the job ran on.
In general, it is not a good idea to modify the input data set for you job - the input data of MapReduce job must be immutable (I assume, you will not modify the same data set you use as an input).
In your case the job is likely to complete with the BatchStatus.COMPLETED unless you put a very specific logic in the last step of your SpringBatch job.
This last step will validate all records and if any broken records detected artificially will set the status of the job to BatchStatus.FAILED like below:
jobExecution.setStatus(BatchStatus.FAILED)
Now how to restart the job is a good question that I will answer in a few moments.
However, before restrting the question you need to ask is: if the input data set for your MapReduce job and the code of your MapReduce job have not changed, how restrt is going to help you?
I think you need to have some kind of data set where you dump all the bad records that the original MapReduce job failed to process. Than how to process these broken records is for you to decide.
Anyway, restarting SpringBatch job is easy, once you know what is the ID of failed jobExecution. Below is the code:
final Long restartId = jobOperator.restart(failedJobId);
final JobExecution restartExecution = jobExplorer.getJobExecution(restartId);
EDIT
Read about ItemReader, ItemWriter and ItemProcessor interfaces
I think that you can achieve tracking by using CompositeItemProcessor.
In Hadoop every record in a file must have a unique ID. So, I think you can store the list of IDs of the bad record in the Job context. Update JobParameter that you would have created when the job starts for the first time, call it badRecordsList. Now when you restart/resume your job, you will read the value of the badRecordsListand will have a reference.
In my project, we have the following processes:
A spring batch job that reads X records from a DB table and dumps
them in rabbitmq as a topic
A spring XD stream that takes the messages from the queue and writes them to a file
Another stream takes the same records as above from the queue and puts them in a table
An independent spring batch job runs about 6 hours later that sends the file generated in (2) to a third party vendor
I want to make sure that the stream in (2) has finished processing. I was thinking of two options:
Have a dummy record at the end of the records in the queue which indicates completion of records (hacky, would prefer not to do this)
Have some sort of a batch identifier and verify that the queue does not contain any message with that batch identifier (How will this work?)
Any alternative suggestions on this problem? Thanks in advance!