I am using Spring Batch to read some data from CSV files and put it in a database.
My Batch job must be compound of 2 steps :
Check files (names, extension, content ..)
Read lines from CSV and save them in DB (ItemReader, ItemProcessor,
ItemWriter..)
Step 2 must not be executed if Step 1 generated an error (files are not conform, files doesn't exist ...)
FYI, I am using Spring Batch without XML configuration ! Only annotations :
Here's what my job config class looks like :
#Configuration
#EnableBatchProcessing
public class ProductionOutConfig {
#Autowired
private StepBuilderFactory steps;
#Autowired
private JobBuilderFactory jobBuilderFactory;
#Autowired
private ProductionOutTasklet productionOutTasklet;
#Autowired
private CheckFilesForProdTasklet checkFilesForProdTasklet;
#Bean
public Job productionOutJob(#Qualifier("productionOut")Step productionOutStep,
#Qualifier("checkFilesForProd") Step checkFilesForProd){
return jobBuilderFactory.get("productionOutJob").start(checkFilesForProd).next(productionOutStep).build();
}
#Bean(name="productionOut")
public Step productionOutStep(){
return steps.get("productionOut").
tasklet(productionOutTasklet)
.build();}
#Bean(name = "checkFilesForProd")
public Step checkFilesForProd(){
return steps.get("checkFilesForProd")
.tasklet(checkFilesForProdTasklet)
.build();
}
}
What you are looking for is already the default behavior of Spring Batch i.e. next step wouldn't be executed if previous step has failed. To mark current step as failed step, you need to throw a run time exception which is not caught.
If exception is not handled, spring batch will mark that step as failed and next step wouldn't be executed. So all you need to do is to throw an exception on your failed scenarios.
For complicated job flows , you might like to use - JobExecutionDecider , Programmatic Flow Decisions
As the documentation specifies you can use the method "on" which starts a transition to a new state if the exit status from the previous state matches the given pattern.
Your code could be similar to something like this :
return jobBuilderFactory.get("productionOutJob")
.start(checkFilesForProd)
.on(ExitStatus.FAILED.getExitCode()).end()
.from(checkFilesForProd)
.on("*")
.to(productionOutStep)
.build();
Related
I've the following doc.
And there are mentioned that:
1.1. Multi-threaded Step The simplest way to start parallel processing is to add a TaskExecutor to your Step configuration.
When using java configuration, a TaskExecutor can be added to the step
as shown in the following example:
#Bean
public TaskExecutor taskExecutor(){
return new SimpleAsyncTaskExecutor("spring_batch");
}
#Bean
public Step sampleStep(TaskExecutor taskExecutor) {
return this.stepBuilderFactory.get("sampleStep")
.<String, String>chunk(10)
.reader(itemReader())
.writer(itemWriter())
.taskExecutor(taskExecutor)
.build();
}
The result of the above configuration is that the Step executes by
reading, processing, and writing each chunk of items (each commit
interval) in a separate thread of execution. Note that this means
there is no fixed order for the items to be processed, and a chunk
might contain items that are non-consecutive compared to the
single-threaded case. In addition to any limits placed by the task
executor (such as whether it is backed by a thread pool), there is a
throttle limit in the tasklet configuration which defaults to 4. You
may need to increase this to ensure that a thread pool is fully
utilized.
But before I thought that it should be achieved by local partitioning and I should provide a partitioner which say how to divide data into pieces. Multi-threaded Step should do it automatically.
Question
Could you explain how does it work ? How can I manage it besides the thread number? Will it work for flat file?
P.S.
I created the example:
#Configuration
public class MultithreadedStepConfig {
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
#Autowired
private ToLowerCasePersonProcessor toLowerCasePersonProcessor;
#Autowired
private DbPersonWriter dbPersonWriter;
#Value("${app.single-file}")
Resource resources;
#Bean
public Job job(Step databaseToDataBaseLowercaseSlaveStep) {
return jobBuilderFactory.get("myMultiThreadedJob")
.incrementer(new RunIdIncrementer())
.flow(csvToDataBaseSlaveStep())
.end()
.build();
}
private Step csvToDataBaseSlaveStep() {
return stepBuilderFactory.get("csvToDatabaseStep")
.<Person, Person>chunk(50)
.reader(csvPersonReaderMulti())
.processor(toLowerCasePersonProcessor)
.writer(dbPersonWriter)
.taskExecutor(jobTaskExecutorMultiThreaded())
.build();
}
#Bean
#StepScope
public FlatFileItemReader csvPersonReaderMulti() {
return new FlatFileItemReaderBuilder()
.name("csvPersonReaderSplitted")
.resource(resources)
.delimited()
.names(new String[]{"firstName", "lastName"})
.fieldSetMapper(new BeanWrapperFieldSetMapper<Person>() {{
setTargetType(Person.class);
}})
.saveState(false)
.build();
}
#Bean
public TaskExecutor jobTaskExecutorMultiThreaded() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
// there are 21 sites currently hence we have 21 threads
taskExecutor.setMaxPoolSize(30);
taskExecutor.setCorePoolSize(25);
taskExecutor.setThreadGroupName("multi-");
taskExecutor.setThreadNamePrefix("multi-");
taskExecutor.afterPropertiesSet();
return taskExecutor;
}
}
And it really works according the log but I want to know details. Is it better than self written partitioner ?
There is basically fundamental differences here when you use multi-threaded steps and partitions.
Multi-threaded steps are single process so it is not a good idea to use this if you have persisted state for processor/writer. However say if you just generating a report without saving anything this is a good choice.
As you have mentioned say you want to process a flat file and say you want store the records in DB then you can use remote chunking concept assuming your reader is not heavy.
Partitioner will create separate process for a each set of data you can use logically divide.
Hope this helps.
based on my understanding, partition is typically used for remote processing. Partition master (or manager) step will create multiple identical workers. The number of works are the given grid size. For local processing, those workers are identical, meaning the same reader, writer objects, but execute on different thread with different chunks of input provided by the customized partitioner. However, if the worker step has listeners, before/after step methods, those methods will be call by each worker; on the contrary, those method get called only once in multi-thread step scheme. Other than that, I don't see any differences for local processing.
I personally suggest don't use partition for local processing, use multi-thread step instead. There are so many open source packages, don't use them if you don't feel comfortable.
I have following batch job implemented in Spring Batch config:
#Bean
public Job myJob(Step step1, Step step2, Step step3) {
return jobs.get("myJob").start(step1).next(step2).next(step3).build();
}
#Bean
public Step step1(ItemReader<String> myReader,
ItemProcessor<String, String> myProcessor,
ItemWriter<String> myWriter) {
return steps.get("step1").<String, String>chunk(1)
.reader(myReader)
.processor(myProcessor)
.writer(myWriter)
.build();
}
I would like to retry the step1 (also step2 and step3 so forth) on certain exception and rollback the job on any failure (also between the retries). I understand that rollback is not going to be automatic and I am clear what to rollback for each step with writing custom code.
What is the best way to implement this?
Spring framework provides #Retryable and #Recover annotations to retry and to recover when something fails. You can check this article. https://www.baeldung.com/spring-retry
Fault tolerance features in Spring Batch are applied to items in chunk-oriented steps, not to the entire step.
What you can try to do is use a flow with a decider where you restart from step1 if an exception occurs in one of the subsequent steps.
I have a Spring boot + batch application which reads a source CSV file, process it and write to target CSV file, I'm struggling with writing tests that will:
use an input - "simpleFlowInput.csv" and compare the "simpleFlowActual.csv" output with an "simpleFlowExpected.csv" file, i would like to write many of these tests but struggle with the way to do it.
My application contain only one step and one job:
#Bean("csvFileToFileStep")
public Step csvFileToFileStep() {
return stepBuilderFactory.get("csvFileToFileStep").<RowInput, RowOutput>chunk(10000).reader(csvRowsReader()).processor(csvRowsProcessor())
.writer(compositeItemWriter()).build();
}
#Bean("csvFileToCsvJob")
Job csvFileToCsvJob(JobCompletionNotificationListener listener) {
return jobBuilderFactory.get("csvFileToCsvJob").incrementer(new RunIdIncrementer()).listener(listener).flow(csvFileToFileStep()).end()
.build();
}
My current test:
#RunWith(SpringJUnit4ClassRunner.class)
#Configuration
#EnableBatchProcessing
#SpringBootTest
public class Tester{
#Autowired
Job csvFileToCsvJob;
#Autowired
Step csvFileToFileStep;
#Autowired
CsvFileReadProcessAndWriteConfig csvFileReadProcessAndWriteConfig;
private JobLauncherTestUtils jobLauncherTestUtils = new JobLauncherTestUtils();
#Test
public void testSimpleFlow() throws Exception {
ClassLoader classLoader = getClass().getClassLoader();
File fileInput = new File(classLoader.getResource("simpleFlowInput.csv").getFile());
File fileActual = new File(classLoader.getResource("simpleFlowActual.csv").getFile());
File fileExpected = new File(classLoader.getResource("simpleFlowExpected.csv").getFile());
FileManager.getInstance().setInputFileLocation(fileInput.toString());
FileManager.getInstance().setOutputFileLocation(fileActual.toString());
System.out.println(fileExpected.length());
System.out.println(fileActual.length());
Assert.assertTrue(fileExpected.length() == fileActual.length());
AssertFile.assertFileEquals(fileExpected,fileActual);//compare
}
}
Any advise on how to test it ?
( I found this question written at 2010 with a partial answer mentioning "JobLauncherTestUtils". What is the best way to test job flow in Spring-Batch? )
The End-To-End Testing of Batch Jobs section of the documentation explains in details how to test a Spring Batch job (including how to use the JobLauncherTestUtils).
Spring Batch provides a nice utility class called AssertFile in the spring-batch-test module which can be helpful in your case: You write the expected file and then assert the actual one (generated by your job) against it. The section Validating Output Files shows how to use this class.
Hope this helps.
I'm evaluating spring batch for a particular project and after a lot of searching around the web, I haven't been able to find a spring batch solution that meets my requirements.
I'm wondering if spring batch is capable of reading multiple CSV files made up of different formats in a single job? For example, lets say Person.csv and Address.csv, both made up of different formats, but depend on each other
I need to read, process with data corrections (ie toUpperCase, etc), and validate each record.
In the event of validation errors, I need to record the error(s) to some sort of object array where it will be made available later on after validation has completed to be emailed to the end user for corrections.
Once all the data from both files has been validated and no validation errors have occurred in either file, continue on to the batch writer. If any errors have occurred in either of the two files, I need to stop the entire job. If the writer has already began writing to the database when an error has occurred, the entire job would need to be rolled back regardless if the error exist in the opposite file.
I cannot insert any of the two CSV files if there is any kind of validation error in either one. The end user must be notified with the errors. The errors will be used to make any necessary corrections prior to reprocessing the files.
Is Spring batch in SpringBoot 2 capable of this behavior?
Example
Person.csv
BatchId, personId, firstName, lastName
Address.csv
BatchId, personId, address1
In the above example the relationship between the two files is the batchId and personId. If there is any kind of validation error in either of the two files, I must fail the entire batch. I'd like to complete validation on both files so I can respond with all the errors, but just not write to the database.
I'm wondering if spring batch is capable of reading multiple CSV files made up of different formats in a single job?
Yes, you can have a single job with multiple steps, each step processing a file of a given type. The point is how to design the job. One technique you can apply is using staging tables. A batch job can create temporary staging tables where it loads all data needed and then remove them when done.
In your case, you can have two steps loading each file in a specific staging table. Each step can apply validation logic specific to each file. If one of these steps fail, you fail the job. Staging tables can have a marker column for invalid records (this is useful for reporting).
Once these two preparatory steps are done, you can read data from the two staging tables in another step and apply cross-validation rules against joined data (for example select from both tables and join by BatchId and PersonId). If this step fails, you fail the job. Otherwise, you write data where appropriate.
The advantage of this technique is that data is available in staging tables during the entire job. So whenever a validation step fails, you can use a flow to redirect the failed step to a "reporting step" (that reads invalid data and sends the report) and then fail the job. Here is a self-contained example you can play with:
import org.springframework.batch.core.Job;
import org.springframework.batch.core.JobParameters;
import org.springframework.batch.core.Step;
import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing;
import org.springframework.batch.core.configuration.annotation.JobBuilderFactory;
import org.springframework.batch.core.configuration.annotation.StepBuilderFactory;
import org.springframework.batch.core.launch.JobLauncher;
import org.springframework.batch.repeat.RepeatStatus;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.ApplicationContext;
import org.springframework.context.annotation.AnnotationConfigApplicationContext;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
#Configuration
#EnableBatchProcessing
public class FlowJobSample {
#Autowired
private JobBuilderFactory jobs;
#Autowired
private StepBuilderFactory steps;
#Bean
public Step personLoadingStep() {
return steps.get("personLoadingStep")
.tasklet((contribution, chunkContext) -> {
System.out.println("personLoadingStep");
return RepeatStatus.FINISHED;
})
.build();
}
#Bean
public Step addressLoadingStep() {
return steps.get("addressLoadingStep")
.tasklet((contribution, chunkContext) -> {
System.out.println("addressLoadingStep");
return RepeatStatus.FINISHED;
})
.build();
}
#Bean
public Step crossValidationStep() {
return steps.get("crossValidationStep")
.tasklet((contribution, chunkContext) -> {
System.out.println("crossValidationStep");
return RepeatStatus.FINISHED;
})
.build();
}
#Bean
public Step reportingStep() {
return steps.get("reportingStep")
.tasklet((contribution, chunkContext) -> {
System.out.println("reportingStep");
return RepeatStatus.FINISHED;
})
.build();
}
#Bean
public Job job() {
return jobs.get("job")
.start(personLoadingStep()).on("INVALID").to(reportingStep())
.from(personLoadingStep()).on("*").to(addressLoadingStep())
.from(addressLoadingStep()).on("INVALID").to(reportingStep())
.from(addressLoadingStep()).on("*").to(crossValidationStep())
.from(crossValidationStep()).on("INVALID").to(reportingStep())
.from(crossValidationStep()).on("*").end()
.from(reportingStep()).on("*").fail()
.build()
.build();
}
public static void main(String[] args) throws Exception {
ApplicationContext context = new AnnotationConfigApplicationContext(FlowJobSample.class);
JobLauncher jobLauncher = context.getBean(JobLauncher.class);
Job job = context.getBean(Job.class);
jobLauncher.run(job, new JobParameters());
}
}
To make one of the steps fail, set the exit status to INVALID, for example:
#Bean
public Step personLoadingStep() {
return steps.get("personLoadingStep")
.tasklet((contribution, chunkContext) -> {
System.out.println("personLoadingStep");
chunkContext.getStepContext().getStepExecution().setExitStatus(new ExitStatus("INVALID"));
return RepeatStatus.FINISHED;
})
.build();
}
I hope this helps.
We have a requirement to carry out data movement from 1 database to other and exploring spring batch for the same. User of our application selects source and target datasource along with the list of tables for which the data needs to be moved.
Need help with following:
The information necessary to build a job comes at runtime from our web application - that includes datasource details and list of table names. We would like to create a new job by sending these details to the job builder module and launch it using JobLauncher. How do we write this job builder module?
We may have multiple users raising data movement requests in parallel, so need a way to create multiple jobs and run them in suitable order.
We have used the Java based configuration to create a job and launch it from a web container. The configuration is as follows
#Bean
public Job loadDataJob(JobCompletionNotificationListener listener) {
RunIdIncrementer inc = new RunIdIncrementer();
inc.setKey(new Date().toString());
JobBuilder builder = jobBuilderFactory.get("loadDataJob")
.incrementer(inc)
.listener(listener);
SimpleJobBuilder simpleBuilder = builder.start(preExecute());
for(String s : getTables()){
simpleBuilder.next(etlTable(s));
}
simpleBuilder.next(postExecute());
return simpleBuilder.build();
}
#Bean
#Scope("prototype")
public Step etlTable(String tableName) {
return stepBuilderFactory.get(tableName)
.<Map<String,Object>, Map<String,Object>> chunk(1000)
.reader(dbDataReader(tableName))
.processor(processor())
.writer(dbDataWriter(tableName))
.build();
}
Currently we have hardcoded the source and target datasource details into respective beans. The getTables() returns a list of tables (hardcoded) for which the data needs to be moved.
RestController that launches the job
#RestController
public class MyController {
#Autowired
JobLauncher jobLauncher;
#Autowired
Job job;
#RequestMapping("/launchjob")
public String handle() throws Exception {
try {
JobParameters jobParameters = new JobParametersBuilder().addLong("time", new Date().getTime()).toJobParameters();
jobLauncher.run(job, jobParameters);
} catch (Exception e) {
}
return "Done";
}
}
Concerning your first question, you definitely have to use JavaConfiguration. Moreover, you shouldn't define your steps as spring beans, if you want to create a job with a dynamic number of steps (for instance a step per table you have to copy).
I've written a couple of answers to questions about how to create jobs dynamically. Have a look at them, they might be helpful
Spring batch execute dynamically generated steps in a tasklet
Spring batch repeat step ending up in never ending loop
Spring Batch - How to generate parallel steps based on params created in a previous step
Spring Batch - Looping a reader/processor/writer step
Edited
Some remarks concerning your second question:
Firstly, you are using a normal JobLauncher and I assume your instantiate the SimpleJobLauncher. This means, you can provide a job with jobparameters, as you have shown in your code above. However, the provided "job" does not have to be a "SpringBean"-instance, so you don't have to Autowire it and therefore, you can use create-methodes as I suggested in the answers to the questions mentioned above.
Secondly, if you create your Job instance for every request dynamically, there is no need to pass the whole configuration as jobparameters, since you can pass the "configuration properties" like datasource and tables to be copied directly as parameters to your "createJob" method. You could even create your DataSource-instances "on the fly", if you don't know all possible datasources in advance.
Thirdly, I would consider every request as a "single run", which cannot be "restarted". Hence, I'd just but some "meta information" into the jobparameters like user, date/time, datasource names (urls) and a list of tables to be copied. I would use this kind of information just as a kind of logging/auditing which requests where issued, but I wouldn't use the jobparameter-instances as controlparameters inside the job itself (again, you can pass the values of these parameters during the construction time of the job and steps by passing them to your create-Methods, so the structure of your job is created according to your parameters and hence, during runtime - when you could access your jobparameters - there is nothing to do based on the jobparameters).
Finally, if a request fails (meaning the jobs exits with an error) simply a new request has to be executed in order to retry, but this request would be a complete new request and not a restart of an already executed job launch (since I would add the request time to my jobparameters, every launch would be a unique launch).
Edited 2:
Not creating the Job as a Bean doesn't mean to not use Autowiring. Here is an example, aus I would structure my Beans.
#Component
#EnableBatchProcessing
#Import() // list with imports as neede
public class JobCreatorComponent {
#Autowire
private StepBuilderFactory stepBuilder;
#Autowire
private JobBuilderFactory jobBuilder;
public Job createJob(all the parameters you need) {
return jobBuilder.get(). ....
}
}
#RestController
#Import(JobCreatorComponent.class)
public class MyController {
#Autowired
JobLauncher jobLauncher;
#Autowired
JobCreatorComponent jobCreator;
#RequestMapping("/launchjob")
public String handle() throws Exception {
try {
Job job = jobCreator.createJob(... params ...);
JobParameters jobParameters = new JobParametersBuilder().addLong("time", new Date().getTime()).toJobParameters();
jobLauncher.run(job, jobParameters);
} catch (Exception e) {
}
return "Done";
}
}
by using #JobScope on itemreader no need to do things manually at run time just have to annoted your respective reader with #Jobscope, on each interaction with controller you will get fresh record processing.
This is type of job on demand where you can execute the job for goals like do the db migration or get the specific reporting like that.