How Spring Batch Restart logic works on hadoop job?

How Spring Batch Restart logic works on hadoop job? - java

Suppose i have 10 records and some of them are corrupted records so how spring will handle restart.
Example suppose record no. 3& 7 are corrupt and they go to different reducer then how spring will handle the restart
1.how it will maintain the queue to track where it last failed.
2.what are the different ways we can solve this one

SpringBatch will do exactly what you tell SpringBatch to do.
Restart for SpringBatch means run the same job that failed with the same set of input parameters. However, the new instance (execution) of this job will be created.
The job will run on the same data set that the failed instance of the job ran on.
In general, it is not a good idea to modify the input data set for you job - the input data of MapReduce job must be immutable (I assume, you will not modify the same data set you use as an input).
In your case the job is likely to complete with the BatchStatus.COMPLETED unless you put a very specific logic in the last step of your SpringBatch job.
This last step will validate all records and if any broken records detected artificially will set the status of the job to BatchStatus.FAILED like below:
jobExecution.setStatus(BatchStatus.FAILED)
Now how to restart the job is a good question that I will answer in a few moments.
However, before restrting the question you need to ask is: if the input data set for your MapReduce job and the code of your MapReduce job have not changed, how restrt is going to help you?
I think you need to have some kind of data set where you dump all the bad records that the original MapReduce job failed to process. Than how to process these broken records is for you to decide.
Anyway, restarting SpringBatch job is easy, once you know what is the ID of failed jobExecution. Below is the code:
final Long restartId = jobOperator.restart(failedJobId);
final JobExecution restartExecution = jobExplorer.getJobExecution(restartId);
EDIT
Read about ItemReader, ItemWriter and ItemProcessor interfaces
I think that you can achieve tracking by using CompositeItemProcessor.
In Hadoop every record in a file must have a unique ID. So, I think you can store the list of IDs of the bad record in the Job context. Update JobParameter that you would have created when the job starts for the first time, call it badRecordsList. Now when you restart/resume your job, you will read the value of the badRecordsListand will have a reference.

Related

Parallel Processing using Java and Spring

I have scenario where my Spring batch job is running every 3 mins.
Steps should be
Each user's record should get executed parallel. Each user can have maximum of 150k records.
Every user can have update and delete records. Update records should run before delete.
Update/delete sets should run parallel on their own. But strictly all updates should complete before delete.
Can anyone suggest the best approach to achieve the parallelism at multiple levels and follow the order at update and delete level.
I am looking something around Spring Async Executor Service, Parallel Streams and other Spring libraries. Rx, only if it gives some glaring performance which the above specified can't provide.

Glaring performance is based on the design of spring batch implementation and we are sure you will get with spring batch as we are processing millions of records with select delete and update.
Each user's record should get executed parllely. Each user can have maximum of 1.5 lakh records.
"Partition the selection based on User and Each user will run as parallel steps."
Every user can have update and delete records. Update records should run before delete.
" Create a Composite Writer and delegates added for update Ist writer and delete 2nd writer "
Update/delete sets should run parallel on their own. But strictly all updates should complete before delete.
"Each writer step update and delete manages the transaction and make sure update executes first ".
Please refer below
Spring Batch multiple process for heavy load with multiple thread under every process
Composite Writer Example
Spring Batch - Read a byte stream, process, write to 2 different csv files convert them to Input stream and store it to ECS and then write to Database

Spark Job creation time

How to access Job creation time using taskcontext.
I'm planing to get this time in the different executors and set it with persisting data which is help full in later processes. Since job creation time is unique even when when retrieved from different executors it helps to keep track of persisted data in a one job.
Is it possible to get from TaskMetrics?
How to access Jobdata class ?

I could do this using listeners. I had to extend the class SparkListners and track when a job ( all the tasks for a job) ends and start and perform actions depending on that.

Quartz: store job instance data

is there a quartz feature to store and read instance specific results with a job instance. Simplest example is a job's status (running, finished ok, finished with errors + msg). This would also mean to store a history of jobs that ran already.
I just saw examples of storing job data which is not instance specific.
Thanks,
Karsten

Adding a trigger in Quartz scheduler for future use

Quartz API provide a way in which i can create a Job and add it to scheduler for future use by doing something like
SchdularFactory.getSchedulerInstance().addJob(jobDetail, false);
This provides me the flexibility to create jobs store them with the scheduler and use them in later stage.
i am wondering is there any way i can create triggers and add them to scheduler to be used in future.
Not sure if this is valid requirement but if its not possible than all i have to do is to associate the Trigger with any given/existing Job

In Quartz there is a one-to-many relationship between jobs and triggers, which is understandable: one job can be run by several different triggers but one trigger can only run a single job. If you need to run several jobs, create one composite job that runs these jobs manually.
Back to your question. Creating a job without associated triggers is a valid use-case: you have a piece of logic and later you will attach one or more triggers to execute it at different points in time.
The opposite situation is weird - you want to create a trigger that will run something at a given time, but you don't know yet what. I can't imagine use-case for that.
Note that you can create a trigger for future use (with next fire time far in the future), but it must have a job attached.
Finally, check out How-To: Storing a Job for Later Use in the official documentation.

Job queueing and execute Mechanism

In my webservice all method calls submits jobs to a queue. Basically these operations take long time to execute, so all these operations submit a Job to a queue and return a status saying "Submitted". Then the client keeps polling using another service method to check for the status of the job.
Presently, what I do is create my own Queue, Job classes that are Serializable and persist these jobs (i.e, their serialized byte stream format) into the database. So an UpdateLogistics operation just queues up a "UpdateLogisticsJob" to the queue and returns. I have written my own JobExecutor which wakes up every N seconds, scans the database table for any existing jobs, and executes them. Note the jobs have to persisted because these jobs have to survive app-server crashes.
This was done a long time ago, and I used bespoke classes for my Queues, Jobs, Executors etc. But now, I would like to know has someone done something similar before? In particular,
Are there frameworks available for this ? Something in Spring/Apache etc
Any framework that is easy to adapt/debug and plays well along with libraries like Spring will be great.
EDIT - Quartz
Sorry if I had not explained more, Quartz is good for stateless jobs (and also for some stateful jobs), but the key for me is very stateful persisted "job instances" (not just jobs or tasks). So for example an operation of executeWorkflow("SUBMIT_LEAVE") might actually create 5 job instances each with atleast 5-10 parameters like userId, accountId etc to be saved into the database.
I was looking for some support around that area, where Job instances can be saved into DB and recreated etc ?

Take a look at JBoss jBPM. It's a workflow definition package that lets you mix automated and manual processes. Tasks are persisted to a DB back end, and it looks like it has some asynchronous execution properties.

I haven't used Quartz for a long time, but I suspect it would be capable of everything you want to do.

spring-batch plus quartz

Depending upon the nature of your job, you might also look into spring-integration to assist with queue processing. But spring-batch will probably handle most of your requirements.

Please try ted-driver (https://github.com/labai/ted)
It's purpose is similar to what you need - you create task (or many of them), which is saved in db, and then ted-driver is responsible to execute it. On error you can postpone retry for later or finish with status error.
Unlike other java frameworks, here tasks are in simple and clear structure in database, where you can manually search or update using standard sql.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.