Is there a simple way to start jobs, only when certain data is available in a DB or a JMS queue, i.e. some kind of conditional job launching / hook, ideally by not using any program that's constantly checking the state of a data source.
Imagine: a preceding job has written some data to a DB, and only when this data is written, the next job will start.
Related
I am currently working on a scheduled task that runs behind the scenes of my Spring web application. The task uses a cron scheduler to execute at midnight every night, and clean-up unused applications for my portal (my site allows users to create an application to fill out, and if they don't access the form within 30 days, my background task will delete it from our DB and inform the user to create a new form if needed with an email). Everything works great in my test environment, and I am ready to move to QA.
However, my next environment uses two load balanced servers to process requests. This is a problem, as the cron scheduler and my polling task run concurrently on both servers. While the read/writes to the DB won't be an issue, the issue lies with sending the notification email to the application user. Without any polling locks, two emails have the possibility to be generated and sent, and I would like to avoid this. Normally, we would use a SQL stored procedure and have a field in our DB for a lock, and then set/release whenever the polling code is called, so only one instance of the polling will be executed. However, with my new polling task, we don't have any fields available, so I am trying to work on a SPRING solution. I found this resource online:
http://www.springframework.net/doc-latest/reference/html/threading.html
And I was thinking of using it as
Semaphore _pollingLock = new Semaphore(1);
_pollingLock.aquire();
try {
//run my polling task
}
finally {
//release lock
}
However, I'm not sure if this will just ensure the second instance executes after, or it skips the second instance and will never execute. Or, is this solution not even appropriate, and there is a better solution. Again, I am using Spring java framework, so any solution that exists there would be my best bet.
Two ways that we've handled this sort of problem in the past both start with designating one of our clustered servers as the one responsible for a specific task (say, sending email, or running a job).
In one solution, we set a JVM parameter on all clustered servers identifying the server name of the one server on which your process should run. For example -DemailSendServer=clusterMember1
In another solution, we simply provided a JVM parameter in the startup of this designated server alone. For example -DsendEmailFromMe=true
In both cases, you can add a tiny bit of code in your process to gate it based on the value or presence of the startup parameter.
I've found the second option simpler to use since the presence of the parameter is enough to allow the process to run. In the first solution, you would have to compare the current server name against the value of the parameter instead.
We haven't done much with Spring Batch, but I would assume there is a way to configure Batch to run a job on a single server within a cluster as well.
I have a java application which does a flink batch processing of a batch obtained by querying the tables from the database and feed it into a kafka topic. How would I make this scheduled periodically. Is there a flink scheduler? For example, my java application should keep running in the background and the flink scheduler should periodically query the tables from the database and flink batch process it and feed into kafka(flink batch processing and feeding into kafka is already done part of my application). Please help if anyone has pointers on this.
Flink does not provide a job scheduler.
Have you considered implementing the use case with a continuously running Flink DataStream application? You could implement a SourceFunction that periodically queries the database.
Continuous streaming applications have the benefits of fewer moving parts (no scheduler, no failure handling if something goes wrong) and a consistent view across the boundaries of "batches". The down side is that the job is always consuming resources (Flink is not able to automatically scale-down at low load).
One of the steps in our job involves running an external process (R in this case) to do some processing on large files in the file system. The external process will then output files which then get fed back into the Spring Batch system.
The external process can take several minutes for each task to complete. We would effectively launch the external process for every file to be processed, so there could easily be on the order of dozens or hundreds of these executions during the life of the overall job. We would like to scale this execution horizontally (and vertically).
Using Spring Batch, would either Remote Chunking or Remote Partitioning be a viable solution for this step? The system really just needs to say "For each of these input files, launch an R script to process it", so there really is not any item or chunk-oriented processing involved.
Remote Chunking/Partitioning has been proving difficult to implement in a sensible manner for this without seeming like overkill. I have thought about instead making this task run "out of band". For example, in the Processor, I would put each "external task" on a JMS queue, let something pull it off and process it and wait for a response that it has finished. This seems like it would be a lot easier than using Remote Chunking/Partitioning.
Other alternative solutions besides Spring Batch are welcome too, but I would like to focus on integrating this solution with Spring Batch for now.
What you are describing is exactly what partitioning does. Even your "out of band" option still falls into what partitioning does.
Let's walk through what I would expect the job to look like.
Job and Master Step
The job, as you noted, is a single step job. What I would envision is that the single step is a partitioned step. With a partitioned step, the two main pieces you need to configure are the Partitioner (the component that knows how to divide the work up) and the PartitionHandler (the component that knows how to send the work to the workers). For the Partitioner, I'd expect using the MultiResourcePartitioner would work. This Partitioner implementation provided by Spring Batch creates one partition per file as defined by it's configuration.
The PartitionHandler is where you choose if you're going to be executing the slaves locally (via the TaskExecutorPartitionHandler) or remotely (via the MessageChannelPartitionHandler). The PartitionHandler is also responsible for aggregating the results of the executing slaves into a single status so the step's result can be evaluated.
Slave Step
For the slave step, there are two pieces. The first is the configuration of the step itself. This is no different than if you were running the step in line. In this example, I'd expect you to use the SystemCommandTasklet to run your R process (unless you're running it on the JVM).
How the step is launched is dependent upon remote vs local partitioning but is also straight forward.
For the record, I did a talk a while back demonstrating remote partitioning that's available on YouTube here: https://www.youtube.com/watch?v=CYTj5YT7CZU The code for that demo is also available on Github here: https://github.com/mminella/Spring-Batch-Talk-2.0
I am planning to use spring batch in a distributed environment. to do some batch processing tasks.
Now when i mean distributed env i mean i have set of boxes with fronteneding web service. Loadbalancer distributes then distributes the job to boxes.
Now I have few questions:
1)What happends if job is terminated half way(say the box got restarted).Will spring batch automatically restart the job?Or do i need to write my own custom watcher and then call spring batch api to restart job?
2)If spring back has this kind of auto restart .Can 2 boxes pick and execute same job at once?
Is this the case?
Spring Batch has four strategies to handle scalability, see here for further details:
Multi-threaded Step (single process)
Parallel Steps (single process)
Remote Chunking of Step (multi process)
Partitioning a Step (single or multi process)
Yours is a multi-process scenario, so you can choose between step remote chunking and step partioning, depending on the cost of the read part compared to the process/write.
But in both cases there cannot be two instances that do duplicate work, it's all designed to avoid that. that could only happened if by accident deploying one of the two single process mechanisms in different machines, that would cause the problem you mention.
Restart logic is also foreseen, see here the Restartability section for further details.
Upon restart the job will go on reading, processing and writing the next chunk of data. If the reader/processor/writer are configured/written taken into that the task is chunked, it will all work out of the box.
Usually it involves including in the write part marking the read items in that chunk as 'processed'.
I'm working on an application that uses Quartz for scheduling Jobs. The Jobs to be scheduled are created programmatically by reading a properties file. My question is: if I have a cluster of several nodes which of these should create schedules programmatically? Only one of these? Or maybe all?
i have used quartz in a web app, where users, among other things, could create quartz jobs that performed certain tasks.
We have had no problems on that app provided that at least the job names are different for each job. You can also have different group names, and if i remember correctly the jobgroup+jobname combination forms a job key.
Anyway we had no problem with creating an running the jobs from different nodes, but quartz at the time(some 6 months ago, i do not believe this has changed but i am not sure) did not offer the possibility to stop jobs in the cluster, it only could stop jobs on the node the stop command was executed on.
If instead you just want to create a fixed number of jobs when the application starts you better delegate that job to one of the nodes, as the jobs name/group will be read from the same properties file for each node, and conflicts will arise.
Have you tried creating them on all of them? I think you would get some conflict because of duplicate names.
So I think one of the members should create the schedules during startup.
You should only have one system scheduling jobs for the cluster if they are predefined in properties like you say. If all of the systems did it you would needlessly recreate the jobs and maybe put them in a weird state if every server made or deleted the same jobs and triggers.
You could simply only deploy the properties for the jobs to one server and then only one server would try to create them.
You could make a separate app that has the purpose of scheduling the jobs and only run it once.
If these are web servers you could make a simple secured REST API that triggers the scheduling process. Then you could write an automated script to access the API and kick off the scheduling of jobs as part of a deployment or whenever else you desired. If you have multiple servers behind a load balancer it should go to only one server and schedule the jobs which quartz would save to the database backed jobstore. The other nodes in the cluster would receive them the next time they update from the database.