spring batch :How it works in distributed environemnt

spring batch :How it works in distributed environemnt - java

I am planning to use spring batch in a distributed environment. to do some batch processing tasks.
Now when i mean distributed env i mean i have set of boxes with fronteneding web service. Loadbalancer distributes then distributes the job to boxes.
Now I have few questions:
1)What happends if job is terminated half way(say the box got restarted).Will spring batch automatically restart the job?Or do i need to write my own custom watcher and then call spring batch api to restart job?
2)If spring back has this kind of auto restart .Can 2 boxes pick and execute same job at once?
Is this the case?

Spring Batch has four strategies to handle scalability, see here for further details:
Multi-threaded Step (single process)
Parallel Steps (single process)
Remote Chunking of Step (multi process)
Partitioning a Step (single or multi process)
Yours is a multi-process scenario, so you can choose between step remote chunking and step partioning, depending on the cost of the read part compared to the process/write.
But in both cases there cannot be two instances that do duplicate work, it's all designed to avoid that. that could only happened if by accident deploying one of the two single process mechanisms in different machines, that would cause the problem you mention.
Restart logic is also foreseen, see here the Restartability section for further details.
Upon restart the job will go on reading, processing and writing the next chunk of data. If the reader/processor/writer are configured/written taken into that the task is chunked, it will all work out of the box.
Usually it involves including in the write part marking the read items in that chunk as 'processed'.

Related

Adding new node to a scalable system with zero downtime

I am working as a developer on a batch processing solution, how it works is that we split a big file and process it across jvms. So we have 4 processor jvms which take a chunk of file and process it and 1 gateway jvm job of gateway jvm is to split the file into no. of processor jvms i.e. 4 and send a rest request which is consumed by processor jvms, rest request has all the details the file location it has to pick the file from and some other details
Now if i want to add another processor jvm without any downtime is there any way we can do it. Currently we are maintaining the urls for 4 jvms in a property file is there any better way to do it ? which provided me the ability to add more jvms without restarting any component

You can consider setting up a load balancer and putting your JVM(s) behind it. The load balancer would be responsible for distributing the incoming requests to the JVMs.
This way you can scale up or scale down your JVM depending on the work load. Also, if one of the JVMs are not working, other part of your system need not care about it anymore.

Not sure what is your use case and tech stack you are following. But it seems that you need to have distributed system with auto-scaling and dynamic provisioning capabilities. Have you considered Hadoop or Spark clusters or Akka?
If you can not use any of it, then solution is to maintain list of JVMs in some datastore (lets say in a table); its dynamic data meaning one can add/remove/update JVMs. Then you need a resource manager who can decide whether to spin up a new JVM based on load or any other conditional logic. This resource manager needs to monitor entire system. Also, whenever you create a task or chunk or slice data then distribute it using message queues such as ApacheMQ, ActiveMQ. You can also consider Kafka for complex use cases. Now a days, application servers such as websphere (Libery profile), weblogic also provide auto-scaling capability. So, if you are already using any of such application server then you can think of making use of that capability. I hope this helps.

Avoiding concurrency in Spring Batch jobs in a cluster environment

I want to ensure that a Spring job is not started a second time while it still runs. This would be trivial in a single JVM environment.
However, how can I achieve this in a cluster environment (more specifically, in JBoss 5.1 - I know a bit antiquated; if solutions exist for later versions, I'd be interested in those as well).
So, it should be kind of a Singleton pattern across all cluster nodes.
I am considering using database locks or a message queue. Is there a simpler / better performing solution?

You need to synchronize threads that doesn't know nothing each other, so the easiest way is to share some information on a common place. Valid alternatives are:
A shared database
A shared file
An external web service holding the status of the batch process
If you prefer to use a shared database try to use a database like Redis to improve your performance. It is an in memory database with persistence on disk, so accessing the status of the batch process should be enough fast.

This is too late but for future lookups: spring batch uses a jpa repository to synchronize jobs, so you can avoid concurrency.

You can add a Job Listener and in the before step and use JobExecutionDao in it to find all JobExecutions. If there are more than one running - throw an exception and exit the job.

Using Spring Batch to horizontally scale external process execution

One of the steps in our job involves running an external process (R in this case) to do some processing on large files in the file system. The external process will then output files which then get fed back into the Spring Batch system.
The external process can take several minutes for each task to complete. We would effectively launch the external process for every file to be processed, so there could easily be on the order of dozens or hundreds of these executions during the life of the overall job. We would like to scale this execution horizontally (and vertically).
Using Spring Batch, would either Remote Chunking or Remote Partitioning be a viable solution for this step? The system really just needs to say "For each of these input files, launch an R script to process it", so there really is not any item or chunk-oriented processing involved.
Remote Chunking/Partitioning has been proving difficult to implement in a sensible manner for this without seeming like overkill. I have thought about instead making this task run "out of band". For example, in the Processor, I would put each "external task" on a JMS queue, let something pull it off and process it and wait for a response that it has finished. This seems like it would be a lot easier than using Remote Chunking/Partitioning.
Other alternative solutions besides Spring Batch are welcome too, but I would like to focus on integrating this solution with Spring Batch for now.

What you are describing is exactly what partitioning does. Even your "out of band" option still falls into what partitioning does.
Let's walk through what I would expect the job to look like.
Job and Master Step
The job, as you noted, is a single step job. What I would envision is that the single step is a partitioned step. With a partitioned step, the two main pieces you need to configure are the Partitioner (the component that knows how to divide the work up) and the PartitionHandler (the component that knows how to send the work to the workers). For the Partitioner, I'd expect using the MultiResourcePartitioner would work. This Partitioner implementation provided by Spring Batch creates one partition per file as defined by it's configuration.
The PartitionHandler is where you choose if you're going to be executing the slaves locally (via the TaskExecutorPartitionHandler) or remotely (via the MessageChannelPartitionHandler). The PartitionHandler is also responsible for aggregating the results of the executing slaves into a single status so the step's result can be evaluated.
Slave Step
For the slave step, there are two pieces. The first is the configuration of the step itself. This is no different than if you were running the step in line. In this example, I'd expect you to use the SystemCommandTasklet to run your R process (unless you're running it on the JVM).
How the step is launched is dependent upon remote vs local partitioning but is also straight forward.
For the record, I did a talk a while back demonstrating remote partitioning that's available on YouTube here: https://www.youtube.com/watch?v=CYTj5YT7CZU The code for that demo is also available on Github here: https://github.com/mminella/Spring-Batch-Talk-2.0

Scheduled job in a multi node environment

I am working on a scheduled job that will run at certain interval (eg. once a day at 1pm), scheduled through Cron. I am working with Java and Spring.
Writing the scheduled job is easy enough - it does: grab list of people will certain criteria from db, for each person do some calculation and trigger a message.
I am working on a single-node environment locally and in testing, however when we go to production, it will be multi-node environment (with load balancer, etc). My concern is how would multi node environment affect the scheduled job?
My guess is I could (or very likely would) end up with triggering duplicate message.
Machine 1: Grab list of people, do calculation
Machine 2: Grab list of people, do calculation
Machine 1: Trigger message
Machine 2: Trigger message
Is my guess correct?
What would be the recommended solution to avoid the above issue? Do I need to create a master/slave distributed system solution to manage multi node environment?

If you have something like three Tomcat instances, each load balanced behind Apache, for example, and on each your application runs then you will have three different triggers and your job will run three times. I don't think you will have a multi-node environment with distributed job execution unless some kind of mechanism for distributing the parts of the job is in place.
If you haven't looked at this project yet, take a peek at Spring XD. It handles Spring Batch Jobs and can be run in distributed mode.

Clustered Quartz scheduler configuration

I'm working on an application that uses Quartz for scheduling Jobs. The Jobs to be scheduled are created programmatically by reading a properties file. My question is: if I have a cluster of several nodes which of these should create schedules programmatically? Only one of these? Or maybe all?

i have used quartz in a web app, where users, among other things, could create quartz jobs that performed certain tasks.
We have had no problems on that app provided that at least the job names are different for each job. You can also have different group names, and if i remember correctly the jobgroup+jobname combination forms a job key.
Anyway we had no problem with creating an running the jobs from different nodes, but quartz at the time(some 6 months ago, i do not believe this has changed but i am not sure) did not offer the possibility to stop jobs in the cluster, it only could stop jobs on the node the stop command was executed on.
If instead you just want to create a fixed number of jobs when the application starts you better delegate that job to one of the nodes, as the jobs name/group will be read from the same properties file for each node, and conflicts will arise.

Have you tried creating them on all of them? I think you would get some conflict because of duplicate names.
So I think one of the members should create the schedules during startup.

You should only have one system scheduling jobs for the cluster if they are predefined in properties like you say. If all of the systems did it you would needlessly recreate the jobs and maybe put them in a weird state if every server made or deleted the same jobs and triggers.
You could simply only deploy the properties for the jobs to one server and then only one server would try to create them.
You could make a separate app that has the purpose of scheduling the jobs and only run it once.
If these are web servers you could make a simple secured REST API that triggers the scheduling process. Then you could write an automated script to access the API and kick off the scheduling of jobs as part of a deployment or whenever else you desired. If you have multiple servers behind a load balancer it should go to only one server and schedule the jobs which quartz would save to the database backed jobstore. The other nodes in the cluster would receive them the next time they update from the database.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.