I am working on a scheduled job that will run at certain interval (eg. once a day at 1pm), scheduled through Cron. I am working with Java and Spring.
Writing the scheduled job is easy enough - it does: grab list of people will certain criteria from db, for each person do some calculation and trigger a message.
I am working on a single-node environment locally and in testing, however when we go to production, it will be multi-node environment (with load balancer, etc). My concern is how would multi node environment affect the scheduled job?
My guess is I could (or very likely would) end up with triggering duplicate message.
Machine 1: Grab list of people, do calculation
Machine 2: Grab list of people, do calculation
Machine 1: Trigger message
Machine 2: Trigger message
Is my guess correct?
What would be the recommended solution to avoid the above issue? Do I need to create a master/slave distributed system solution to manage multi node environment?
If you have something like three Tomcat instances, each load balanced behind Apache, for example, and on each your application runs then you will have three different triggers and your job will run three times. I don't think you will have a multi-node environment with distributed job execution unless some kind of mechanism for distributing the parts of the job is in place.
If you haven't looked at this project yet, take a peek at Spring XD. It handles Spring Batch Jobs and can be run in distributed mode.
Related
One of the steps in our job involves running an external process (R in this case) to do some processing on large files in the file system. The external process will then output files which then get fed back into the Spring Batch system.
The external process can take several minutes for each task to complete. We would effectively launch the external process for every file to be processed, so there could easily be on the order of dozens or hundreds of these executions during the life of the overall job. We would like to scale this execution horizontally (and vertically).
Using Spring Batch, would either Remote Chunking or Remote Partitioning be a viable solution for this step? The system really just needs to say "For each of these input files, launch an R script to process it", so there really is not any item or chunk-oriented processing involved.
Remote Chunking/Partitioning has been proving difficult to implement in a sensible manner for this without seeming like overkill. I have thought about instead making this task run "out of band". For example, in the Processor, I would put each "external task" on a JMS queue, let something pull it off and process it and wait for a response that it has finished. This seems like it would be a lot easier than using Remote Chunking/Partitioning.
Other alternative solutions besides Spring Batch are welcome too, but I would like to focus on integrating this solution with Spring Batch for now.
What you are describing is exactly what partitioning does. Even your "out of band" option still falls into what partitioning does.
Let's walk through what I would expect the job to look like.
Job and Master Step
The job, as you noted, is a single step job. What I would envision is that the single step is a partitioned step. With a partitioned step, the two main pieces you need to configure are the Partitioner (the component that knows how to divide the work up) and the PartitionHandler (the component that knows how to send the work to the workers). For the Partitioner, I'd expect using the MultiResourcePartitioner would work. This Partitioner implementation provided by Spring Batch creates one partition per file as defined by it's configuration.
The PartitionHandler is where you choose if you're going to be executing the slaves locally (via the TaskExecutorPartitionHandler) or remotely (via the MessageChannelPartitionHandler). The PartitionHandler is also responsible for aggregating the results of the executing slaves into a single status so the step's result can be evaluated.
Slave Step
For the slave step, there are two pieces. The first is the configuration of the step itself. This is no different than if you were running the step in line. In this example, I'd expect you to use the SystemCommandTasklet to run your R process (unless you're running it on the JVM).
How the step is launched is dependent upon remote vs local partitioning but is also straight forward.
For the record, I did a talk a while back demonstrating remote partitioning that's available on YouTube here: https://www.youtube.com/watch?v=CYTj5YT7CZU The code for that demo is also available on Github here: https://github.com/mminella/Spring-Batch-Talk-2.0
I am planning to use spring batch in a distributed environment. to do some batch processing tasks.
Now when i mean distributed env i mean i have set of boxes with fronteneding web service. Loadbalancer distributes then distributes the job to boxes.
Now I have few questions:
1)What happends if job is terminated half way(say the box got restarted).Will spring batch automatically restart the job?Or do i need to write my own custom watcher and then call spring batch api to restart job?
2)If spring back has this kind of auto restart .Can 2 boxes pick and execute same job at once?
Is this the case?
Spring Batch has four strategies to handle scalability, see here for further details:
Multi-threaded Step (single process)
Parallel Steps (single process)
Remote Chunking of Step (multi process)
Partitioning a Step (single or multi process)
Yours is a multi-process scenario, so you can choose between step remote chunking and step partioning, depending on the cost of the read part compared to the process/write.
But in both cases there cannot be two instances that do duplicate work, it's all designed to avoid that. that could only happened if by accident deploying one of the two single process mechanisms in different machines, that would cause the problem you mention.
Restart logic is also foreseen, see here the Restartability section for further details.
Upon restart the job will go on reading, processing and writing the next chunk of data. If the reader/processor/writer are configured/written taken into that the task is chunked, it will all work out of the box.
Usually it involves including in the write part marking the read items in that chunk as 'processed'.
I'm new to web servers. I have a java class that does a set of computations. I want to have this java class run every hour and update my domain on AWS, with the data.
My question is how/where do I set this job to run?
Is there a standard for this? Or does AWS have something I can use? I know how to read/write my data to AWS.
Should a cron job be used? Should the cron job run on AWS?
You have 2 options for this.
Set a cron job and let the operating system execute the script that starts your java program every hour or so.
Use something like Quartz Scheduler. In this case your Java program would be running continuously and the scheduler would be within your Java program.
There are various advantages and disadvantages to both approaches. In the first case the advantage is that if something wrong happens to the program, you know that in the next hour a new process with a fresh new instance of your program will launch, while in the second case if your Java program hangs for some reason you won't know unless you have some kind of monitoring. However, in case 2 you can maintain some kind of state information you might want to keep between runs. Quartz has also lots of advanced features, like maintaining info about executions in a database.
You can also have the Quartz Scheduler run within your webserver itself (so no need for another process). Its just an extra few .jar files to include. So it depends what you actually want to do. You can refer to what features it supports here.
I've got a Spring Web application that's running on two different instances.
The two instances aren't aware of each other, they run on distinct servers.
That application has a scheduled Quartz job but my problem is that the job shouldn't execute simultaneously on the instances, as its a mail sending job, it could cause duplicate emails being sent.
I'm using RAMJobStore and JDBCJobStore is not an option for me due to the large number of tables it requires.(I cant afford to create many tables due to internal restriction)
The solutions I thought about:
-creating a single control table, that has to be checked everytime a job starts (with repeatable read isolation level to avoid concurrency issues) The problem is that if the server is killed, the table might be left in a invalid state.
-using properties to define a single server to be the job running server. Problem is that if that server goes down, jobs will stop running
Has anyone ever experienced this problem and do you have any thoughts to share?
Start with the second solution (deactivate qartz on all nodes except one). It is very simple to do and it is safe. Count how frequently your server goes down. If it is inacceptable then try the first solution. The problem with the first solution is that you need a good skill in mutithreaded programming to implement it without bugs. It is not so simple if multithreading is not your everyday task. And a cost of some bug in your implementation may be bigger than actual profit.
I'm working on an application that uses Quartz for scheduling Jobs. The Jobs to be scheduled are created programmatically by reading a properties file. My question is: if I have a cluster of several nodes which of these should create schedules programmatically? Only one of these? Or maybe all?
i have used quartz in a web app, where users, among other things, could create quartz jobs that performed certain tasks.
We have had no problems on that app provided that at least the job names are different for each job. You can also have different group names, and if i remember correctly the jobgroup+jobname combination forms a job key.
Anyway we had no problem with creating an running the jobs from different nodes, but quartz at the time(some 6 months ago, i do not believe this has changed but i am not sure) did not offer the possibility to stop jobs in the cluster, it only could stop jobs on the node the stop command was executed on.
If instead you just want to create a fixed number of jobs when the application starts you better delegate that job to one of the nodes, as the jobs name/group will be read from the same properties file for each node, and conflicts will arise.
Have you tried creating them on all of them? I think you would get some conflict because of duplicate names.
So I think one of the members should create the schedules during startup.
You should only have one system scheduling jobs for the cluster if they are predefined in properties like you say. If all of the systems did it you would needlessly recreate the jobs and maybe put them in a weird state if every server made or deleted the same jobs and triggers.
You could simply only deploy the properties for the jobs to one server and then only one server would try to create them.
You could make a separate app that has the purpose of scheduling the jobs and only run it once.
If these are web servers you could make a simple secured REST API that triggers the scheduling process. Then you could write an automated script to access the API and kick off the scheduling of jobs as part of a deployment or whenever else you desired. If you have multiple servers behind a load balancer it should go to only one server and schedule the jobs which quartz would save to the database backed jobstore. The other nodes in the cluster would receive them the next time they update from the database.