We have a requirement to carry out data movement from 1 database to other and exploring spring batch for the same. User of our application selects source and target datasource along with the list of tables for which the data needs to be moved.
Need help with following:
The information necessary to build a job comes at runtime from our web application - that includes datasource details and list of table names. We would like to create a new job by sending these details to the job builder module and launch it using JobLauncher. How do we write this job builder module?
We may have multiple users raising data movement requests in parallel, so need a way to create multiple jobs and run them in suitable order.
We have used the Java based configuration to create a job and launch it from a web container. The configuration is as follows
#Bean
public Job loadDataJob(JobCompletionNotificationListener listener) {
RunIdIncrementer inc = new RunIdIncrementer();
inc.setKey(new Date().toString());
JobBuilder builder = jobBuilderFactory.get("loadDataJob")
.incrementer(inc)
.listener(listener);
SimpleJobBuilder simpleBuilder = builder.start(preExecute());
for(String s : getTables()){
simpleBuilder.next(etlTable(s));
}
simpleBuilder.next(postExecute());
return simpleBuilder.build();
}
#Bean
#Scope("prototype")
public Step etlTable(String tableName) {
return stepBuilderFactory.get(tableName)
.<Map<String,Object>, Map<String,Object>> chunk(1000)
.reader(dbDataReader(tableName))
.processor(processor())
.writer(dbDataWriter(tableName))
.build();
}
Currently we have hardcoded the source and target datasource details into respective beans. The getTables() returns a list of tables (hardcoded) for which the data needs to be moved.
RestController that launches the job
#RestController
public class MyController {
#Autowired
JobLauncher jobLauncher;
#Autowired
Job job;
#RequestMapping("/launchjob")
public String handle() throws Exception {
try {
JobParameters jobParameters = new JobParametersBuilder().addLong("time", new Date().getTime()).toJobParameters();
jobLauncher.run(job, jobParameters);
} catch (Exception e) {
}
return "Done";
}
}
Concerning your first question, you definitely have to use JavaConfiguration. Moreover, you shouldn't define your steps as spring beans, if you want to create a job with a dynamic number of steps (for instance a step per table you have to copy).
I've written a couple of answers to questions about how to create jobs dynamically. Have a look at them, they might be helpful
Spring batch execute dynamically generated steps in a tasklet
Spring batch repeat step ending up in never ending loop
Spring Batch - How to generate parallel steps based on params created in a previous step
Spring Batch - Looping a reader/processor/writer step
Edited
Some remarks concerning your second question:
Firstly, you are using a normal JobLauncher and I assume your instantiate the SimpleJobLauncher. This means, you can provide a job with jobparameters, as you have shown in your code above. However, the provided "job" does not have to be a "SpringBean"-instance, so you don't have to Autowire it and therefore, you can use create-methodes as I suggested in the answers to the questions mentioned above.
Secondly, if you create your Job instance for every request dynamically, there is no need to pass the whole configuration as jobparameters, since you can pass the "configuration properties" like datasource and tables to be copied directly as parameters to your "createJob" method. You could even create your DataSource-instances "on the fly", if you don't know all possible datasources in advance.
Thirdly, I would consider every request as a "single run", which cannot be "restarted". Hence, I'd just but some "meta information" into the jobparameters like user, date/time, datasource names (urls) and a list of tables to be copied. I would use this kind of information just as a kind of logging/auditing which requests where issued, but I wouldn't use the jobparameter-instances as controlparameters inside the job itself (again, you can pass the values of these parameters during the construction time of the job and steps by passing them to your create-Methods, so the structure of your job is created according to your parameters and hence, during runtime - when you could access your jobparameters - there is nothing to do based on the jobparameters).
Finally, if a request fails (meaning the jobs exits with an error) simply a new request has to be executed in order to retry, but this request would be a complete new request and not a restart of an already executed job launch (since I would add the request time to my jobparameters, every launch would be a unique launch).
Edited 2:
Not creating the Job as a Bean doesn't mean to not use Autowiring. Here is an example, aus I would structure my Beans.
#Component
#EnableBatchProcessing
#Import() // list with imports as neede
public class JobCreatorComponent {
#Autowire
private StepBuilderFactory stepBuilder;
#Autowire
private JobBuilderFactory jobBuilder;
public Job createJob(all the parameters you need) {
return jobBuilder.get(). ....
}
}
#RestController
#Import(JobCreatorComponent.class)
public class MyController {
#Autowired
JobLauncher jobLauncher;
#Autowired
JobCreatorComponent jobCreator;
#RequestMapping("/launchjob")
public String handle() throws Exception {
try {
Job job = jobCreator.createJob(... params ...);
JobParameters jobParameters = new JobParametersBuilder().addLong("time", new Date().getTime()).toJobParameters();
jobLauncher.run(job, jobParameters);
} catch (Exception e) {
}
return "Done";
}
}
by using #JobScope on itemreader no need to do things manually at run time just have to annoted your respective reader with #Jobscope, on each interaction with controller you will get fresh record processing.
This is type of job on demand where you can execute the job for goals like do the db migration or get the specific reporting like that.
Related
I need help. I wish to change in runtime the itemWriter based on processor result
#Configuration
public class StepConfig {
#Autowired
private StepBuilderFactory stepBuilderFactory;
#Bean
public Step step(
ItemReader<Sample> itemReader,
ItemProcessor<Sample, Sample> beanProcessor,
ItemWriter<Sample> itemWriter
) {
return stepBuilderFactory.get("our step")
.<Sample, Sample>chunk(1)
.reader(itemReader)
.processor(beanProcessor).listener(new ProcessorListener())
.writer(itemWriter)
.build();
}
}
My processor is a bean validation processor. If the validation succeed, I need a success writer. If it throws exception, I need to execute a exception writer.
How can I do that?
Thanks!
I did that:
I created 2 writers and tie them with a CompositeItemWriter
I pass a DTO with reference to the error for failure or/and the reference data processed with success.
The DTO is only for passing through processor to writers.
Each writer check if DTO has your respective data. If success data is there, work success data, otherwise return. If error data is there, work error data, otherwise return. CompositeItemWriter will always call all writers inside of it.
To be clear, in my case I had a reader that read information from a table. A pool of processors that processed the information, and at the end, populated a DTO. The DTO was only for take information from the processor to the CompositeItemWriter. The DTO had reference to two objects, one for success and another for error. Each writer checked its field on the DTO.
Maybe is possible to define Spring Batch Flow for your case as well. It would be the most elegant way, but you have to study the documentation to see the best way to implement it.
I have solved the problem after days!
First, I got a Classifier Writer. It decide which writer choose based in
my object property.
I highlight that before the classifier, I had to deal with the main problem, which was the BeanValidation processor.
As a bean validation processor, it works with just :
#Bean
public ItemProcessor<Sample,Sample> validate() {
return new BeanValidatingItemProcessor<>();
}
But I need to know if the validation will fail. So I tryed use a try/catch inside the method above and use the .process(sample) to see that. Inexplicably the method throws NPE.
I solved this making a bean:
#Configuration
public class Beans {
#Bean
public ValidatingItemProcessor<Sample> validatorBean() {
return new BeanValidatingItemProcessor<Sample>();
}
}
and injecting it in my processor. Finally it works.
Thank you all.
I've the following doc.
And there are mentioned that:
1.1. Multi-threaded Step The simplest way to start parallel processing is to add a TaskExecutor to your Step configuration.
When using java configuration, a TaskExecutor can be added to the step
as shown in the following example:
#Bean
public TaskExecutor taskExecutor(){
return new SimpleAsyncTaskExecutor("spring_batch");
}
#Bean
public Step sampleStep(TaskExecutor taskExecutor) {
return this.stepBuilderFactory.get("sampleStep")
.<String, String>chunk(10)
.reader(itemReader())
.writer(itemWriter())
.taskExecutor(taskExecutor)
.build();
}
The result of the above configuration is that the Step executes by
reading, processing, and writing each chunk of items (each commit
interval) in a separate thread of execution. Note that this means
there is no fixed order for the items to be processed, and a chunk
might contain items that are non-consecutive compared to the
single-threaded case. In addition to any limits placed by the task
executor (such as whether it is backed by a thread pool), there is a
throttle limit in the tasklet configuration which defaults to 4. You
may need to increase this to ensure that a thread pool is fully
utilized.
But before I thought that it should be achieved by local partitioning and I should provide a partitioner which say how to divide data into pieces. Multi-threaded Step should do it automatically.
Question
Could you explain how does it work ? How can I manage it besides the thread number? Will it work for flat file?
P.S.
I created the example:
#Configuration
public class MultithreadedStepConfig {
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
#Autowired
private ToLowerCasePersonProcessor toLowerCasePersonProcessor;
#Autowired
private DbPersonWriter dbPersonWriter;
#Value("${app.single-file}")
Resource resources;
#Bean
public Job job(Step databaseToDataBaseLowercaseSlaveStep) {
return jobBuilderFactory.get("myMultiThreadedJob")
.incrementer(new RunIdIncrementer())
.flow(csvToDataBaseSlaveStep())
.end()
.build();
}
private Step csvToDataBaseSlaveStep() {
return stepBuilderFactory.get("csvToDatabaseStep")
.<Person, Person>chunk(50)
.reader(csvPersonReaderMulti())
.processor(toLowerCasePersonProcessor)
.writer(dbPersonWriter)
.taskExecutor(jobTaskExecutorMultiThreaded())
.build();
}
#Bean
#StepScope
public FlatFileItemReader csvPersonReaderMulti() {
return new FlatFileItemReaderBuilder()
.name("csvPersonReaderSplitted")
.resource(resources)
.delimited()
.names(new String[]{"firstName", "lastName"})
.fieldSetMapper(new BeanWrapperFieldSetMapper<Person>() {{
setTargetType(Person.class);
}})
.saveState(false)
.build();
}
#Bean
public TaskExecutor jobTaskExecutorMultiThreaded() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
// there are 21 sites currently hence we have 21 threads
taskExecutor.setMaxPoolSize(30);
taskExecutor.setCorePoolSize(25);
taskExecutor.setThreadGroupName("multi-");
taskExecutor.setThreadNamePrefix("multi-");
taskExecutor.afterPropertiesSet();
return taskExecutor;
}
}
And it really works according the log but I want to know details. Is it better than self written partitioner ?
There is basically fundamental differences here when you use multi-threaded steps and partitions.
Multi-threaded steps are single process so it is not a good idea to use this if you have persisted state for processor/writer. However say if you just generating a report without saving anything this is a good choice.
As you have mentioned say you want to process a flat file and say you want store the records in DB then you can use remote chunking concept assuming your reader is not heavy.
Partitioner will create separate process for a each set of data you can use logically divide.
Hope this helps.
based on my understanding, partition is typically used for remote processing. Partition master (or manager) step will create multiple identical workers. The number of works are the given grid size. For local processing, those workers are identical, meaning the same reader, writer objects, but execute on different thread with different chunks of input provided by the customized partitioner. However, if the worker step has listeners, before/after step methods, those methods will be call by each worker; on the contrary, those method get called only once in multi-thread step scheme. Other than that, I don't see any differences for local processing.
I personally suggest don't use partition for local processing, use multi-thread step instead. There are so many open source packages, don't use them if you don't feel comfortable.
I am using Spring Batch to read some data from CSV files and put it in a database.
My Batch job must be compound of 2 steps :
Check files (names, extension, content ..)
Read lines from CSV and save them in DB (ItemReader, ItemProcessor,
ItemWriter..)
Step 2 must not be executed if Step 1 generated an error (files are not conform, files doesn't exist ...)
FYI, I am using Spring Batch without XML configuration ! Only annotations :
Here's what my job config class looks like :
#Configuration
#EnableBatchProcessing
public class ProductionOutConfig {
#Autowired
private StepBuilderFactory steps;
#Autowired
private JobBuilderFactory jobBuilderFactory;
#Autowired
private ProductionOutTasklet productionOutTasklet;
#Autowired
private CheckFilesForProdTasklet checkFilesForProdTasklet;
#Bean
public Job productionOutJob(#Qualifier("productionOut")Step productionOutStep,
#Qualifier("checkFilesForProd") Step checkFilesForProd){
return jobBuilderFactory.get("productionOutJob").start(checkFilesForProd).next(productionOutStep).build();
}
#Bean(name="productionOut")
public Step productionOutStep(){
return steps.get("productionOut").
tasklet(productionOutTasklet)
.build();}
#Bean(name = "checkFilesForProd")
public Step checkFilesForProd(){
return steps.get("checkFilesForProd")
.tasklet(checkFilesForProdTasklet)
.build();
}
}
What you are looking for is already the default behavior of Spring Batch i.e. next step wouldn't be executed if previous step has failed. To mark current step as failed step, you need to throw a run time exception which is not caught.
If exception is not handled, spring batch will mark that step as failed and next step wouldn't be executed. So all you need to do is to throw an exception on your failed scenarios.
For complicated job flows , you might like to use - JobExecutionDecider , Programmatic Flow Decisions
As the documentation specifies you can use the method "on" which starts a transition to a new state if the exit status from the previous state matches the given pattern.
Your code could be similar to something like this :
return jobBuilderFactory.get("productionOutJob")
.start(checkFilesForProd)
.on(ExitStatus.FAILED.getExitCode()).end()
.from(checkFilesForProd)
.on("*")
.to(productionOutStep)
.build();
I have set up a java batch project with spring batch that allows to persist the rows of a CSV in a table of a database.
I would have liked to know if it was possible with Spring API REST to trigger the Batch with a POST method that would join the necessary CSV.
thank you in advance
You can do that using a Controller with a JobLauncher and Job. The barebones of the controller would be like this
#RestController
public class MyController{
// Usually given by Spring Batch
private JobLauncher jobLauncher;
// Your Job
private Job job;
// Ctor
public MyController(JobLauncher jobLauncher, Job job, ...){}
#PostMapping("/")
public String launchJob(...){
...
// Create JobParameters and launch
JobParameters jobparameters = new Job Parameters();
jobLauncher.run(job, jobParameters);
...
}
}
SimpleJobLauncher, the implementation of JobLauncher, uses a synchronous executor by default, you'll probably want to change it to an Async one depending of your requirements
I've read about AbstractRoutingDataSource and the standard ways to bind a datasource dynamically in this article:
public class CustomerRoutingDataSource extends AbstractRoutingDataSource {
#Override
protected Object determineCurrentLookupKey() {
return CustomerContextHolder.getCustomerType();
}
}
It uses a ThreadLocal context holder to "set" the DataSource:
public class CustomerContextHolder {
private static final ThreadLocal<CustomerType> contextHolder =
new ThreadLocal<CustomerType>();
public static void setCustomerType(CustomerType customerType) {
Assert.notNull(customerType, "customerType cannot be null");
contextHolder.set(customerType);
}
public static CustomerType getCustomerType() {
return (CustomerType) contextHolder.get();
}
// ...
}
I have a quite complex system where threads are not necessarily in my control, say:
Scheduled EJB reads a job list from the database
For each Job it fires a Spring (or Java EE) batch job.
Each job have its origin and destination databases (read from a central database).
Multiple jobs will run in parallel
Jobs may be multithreaded.
ItemReader will use the origin data source that was set for that specific job (origin data source must be bound to some repositories)
ItemWriter will use the destination data source that was set for that specific job (destination data source must also be bound to some repositories).
So I'm feeling somewhat anxious about ThreadLocal, specially, I'm not sure if the same thread will be used to handle multiple jobs. If that happens origin and destination databases may get mixed.
How can I "store" and bind a data source dynamically in a safe way when dealing with multiple threads?
I could not find a way to setup Spring to play nice with my setup and inject the desired DataSource, so I've decided to handle that manually.
Detailed solution:
I changed my repositories to be prototypes so that a new instance is constructed every time that I wire it:
#Repository
#Scope(BeanDefinition.SCOPE_PROTOTYPE)
I've introduced new setDataSource and setSchema methods in top level interfaces / implementations that are supposed to work with multiple instances / schemas.
Since I'm using spring-data-jdbc-repository my setDataSource method simple wraps the DataSource with a new JdbcTemplate and propagate the change.
setJdbcOperations(new JdbcTemplate(dataSource));
My implementation is obtaining the DataSources directly from the application server:
final Context context = new InitialContext();
final DataSource dataSource = (DataSource) context.lookup("jdbc/" + dsName);
Finally, for multiples schemas under the same database instance, I'm logging in with a special user (with the correct permissions) and using a Oracle command to switch to the desired schema:
getJdbcOperations().execute("ALTER SESSION SET CURRENT_SCHEMA = " + schema);
While this goes against the Dependency inversion principle it works and is handling my concurrency requirements very well.