In my webservice all method calls submits jobs to a queue. Basically these operations take long time to execute, so all these operations submit a Job to a queue and return a status saying "Submitted". Then the client keeps polling using another service method to check for the status of the job.
Presently, what I do is create my own Queue, Job classes that are Serializable and persist these jobs (i.e, their serialized byte stream format) into the database. So an UpdateLogistics operation just queues up a "UpdateLogisticsJob" to the queue and returns. I have written my own JobExecutor which wakes up every N seconds, scans the database table for any existing jobs, and executes them. Note the jobs have to persisted because these jobs have to survive app-server crashes.
This was done a long time ago, and I used bespoke classes for my Queues, Jobs, Executors etc. But now, I would like to know has someone done something similar before? In particular,
Are there frameworks available for this ? Something in Spring/Apache etc
Any framework that is easy to adapt/debug and plays well along with libraries like Spring will be great.
EDIT - Quartz
Sorry if I had not explained more, Quartz is good for stateless jobs (and also for some stateful jobs), but the key for me is very stateful persisted "job instances" (not just jobs or tasks). So for example an operation of executeWorkflow("SUBMIT_LEAVE") might actually create 5 job instances each with atleast 5-10 parameters like userId, accountId etc to be saved into the database.
I was looking for some support around that area, where Job instances can be saved into DB and recreated etc ?
Take a look at JBoss jBPM. It's a workflow definition package that lets you mix automated and manual processes. Tasks are persisted to a DB back end, and it looks like it has some asynchronous execution properties.
I haven't used Quartz for a long time, but I suspect it would be capable of everything you want to do.
spring-batch plus quartz
Depending upon the nature of your job, you might also look into spring-integration to assist with queue processing. But spring-batch will probably handle most of your requirements.
Please try ted-driver (https://github.com/labai/ted)
It's purpose is similar to what you need - you create task (or many of them), which is saved in db, and then ted-driver is responsible to execute it. On error you can postpone retry for later or finish with status error.
Unlike other java frameworks, here tasks are in simple and clear structure in database, where you can manually search or update using standard sql.
Related
I have run into a case where I have to use a persistent Scheduler, since I have a web application that can crash or close due to some problems and might lose it job details if this happens . I have tried the following:
Use Quartz scheduler:
I used RAMJobStore first, but since it isn't persistent, it wasn't of much help. Can't setup JDBCJobStore because, this will require huge code changes to my existing code base.
In light of such a scenario,
I have the following queries:
If I use Spring's built in #Schedule annotation will my jobs be persistent..? I don't mind if the jobs get scheduled after the application starts. All I want is the jobs to not lose their details and triggers.?
If not, are there any other alternatives that can be followed , keeping in mind that I need to schedule multiple jobs with my scheduler.?
If yes, how can I achieve this.? My triggers are different each job. For e.g I might have a job that is scheduled at 9AM and another at 8.30AM and so on.
If not a scheduler, then can I have a mechanism to handle this.?
One thing, I found is that the documentation for Quartz isn't very descriptive. I mean it's fine for a top level config, but configuring it on your an application is a pain. This is just a side note. Nothing to do with the question.
Appreciate the help. :)
No, Spring's #Schedule-annotation will typically only instruct Spring at what times a certain task should be scheduled to run within the current VM. As far as I know there is not a context for the execution either. The schedule is static.
I had a similar requirement and created db-scheduler (https://github.com/kagkarlsson/db-scheduler), a simple, persistent and cluster-friendly scheduler. It stores the next execution-time in the database, and triggers execution once it is reached.
A very simple example for a RecurringTask without context could look like this:
final RecurringTask myDailyTask = ComposableTask.recurringTask("my-daily-task", Schedules.daily(LocalTime.of(8, 0)),
() -> System.out.println("Executed!"));
final Scheduler scheduler = Scheduler
.create(dataSource)
.startTasks(myDailyTask)
.threads(5)
.build();
scheduler.start();
It will execute the task named my-daily-task at 08:00 every day. It will be scheduled in the database when the scheduler is first started, unless it already exists in the database.
If you want to schedule an ad-hoc task some time in the future with context, you can use the OneTimeTask:
final OneTimeTask oneTimeTask = ComposableTask.onetimeTask("my-onetime-task",
(taskInstance, context) -> System.out.println("One-time task with identifier "+taskInstance.getId()+" executed!"));
scheduler.scheduleForExecution(LocalDateTime.now().plusDays(1), oneTimeTask.instance("1001"));
See the example above. Any number of tasks can be scheduled, as long as task-name and instanceIdentifier is unique.
#Schedule has nothing to do with the actual executor. The default java executors aren't persistent (maybe there are some app-server specific ones that are), if you want persistence you have to use Quartz for job execution.
I'm making use of Quartz Scheduling and there are 2 jobs. First Job is performing the tasks for around 2 minutes and the Second one is to be setup for Cleaning Operations of Temporary Files. So, I need to setup the Schedule to work in a way that after the first job is executed/finished performing tasks I need to do the cleaning operations with the help of Second Job.
Considering the Example 9 - Job Listeners under Quartz 2.1.x which states that we can define a method named jobWasExecuted( _, _ ); in the Job Listener and it executes when the 1st job is executed/or comes in running state.
Are we able to setup the schedule which can listen for the first job finishing then executes second? or,
Are we able to define the join() method like in Java Multithreading which can execute on the completion of first job?
There currently is no "direct" or "free" way to chain triggers with
Quartz. However there are several ways you can accomplish it without
much effort. Below is an outline of a couple approaches:
One way is to use a listener (i.e. a TriggerListener, JobListener or
SchedulerListener) that can notice the completion of a job/trigger and
then immediately schedule a new trigger to fire. This approach can get
a bit involved, since you'll have to inform the listener which job
follows which - and you may need to worry about persistence of this
information.
Another way is to build a Job that contains within its JobDataMap the name of the next job to fire, and as the job completes (the last step in its Execute() method) have the job schedule the next job. Several people are doing this and have had good luck. Most have made a base (abstract) class that is a Job that knows how to get the job name and group out of the JobDataMap using special keys (constants) and contains code to schedule the identified job. Then
they simply make extensions of this class that included the additional
work the job should do.
Ref: http://www.quartz-scheduler.net/documentation/faq.html#how-do-i-chain-job-execution?-or,-how-do-i-create-a-workflow?
I know this is an old question, but nevertheless there are 2 more options available to chain the execution of your jobs which people can find useful:
1) Use the JobChainingJobListener that is included in the standard Quartz distribution since very early releases. This listener allows you to programmatically define simple job chains using its addJobChainLink method.
2) Use a commercial solution such as QuartzDesk that I am the principal developer of. QuartzDesk contains a robust job chaining engine that allows you to externalize the definition of your job chains from the application code and enables you to update your job chains at runtime through a GUI without modifying, redeploying and restarting your application. A job chain can be associated with a particular job, trigger or it can be a global job chain that is executed whenever any of your jobs execute (useful for global job execution failure handlers etc.).
From http://www.quartz-scheduler.net/documentation/faq.html#how-do-i-chain-job-execution?-or,-how-do-i-create-a-workflow
How do I keep a Job from firing concurrently?
Quartz.NET 2.x
Implement IJob and also decorate your job class with
[DisallowConcurrentExecution] attribute. Read the API documentation
for DisallowConcurrentExecutionAttribute for more information.
The annotation is available in the Java implementation.
Given the following facts, is there a existing open-source Java API (possibly as part of some greater product) that implements an algorithm enabling the reproducible ordering of events in a cluster environment:
1) There are N sources of events, each with a unique ID.
2) Each event produced has an ID/timestamp, which, together with
its source ID, makes it uniquely identifiable.
3) The ids can be used to sort the events.
4) There are M application servers receiving those events.
M is normally 3.
5) The events can arrive at any one or more of the application
servers, in no specific order.
6) The events are processed in batches.
7) The servers have to agree for each batch on the list of events
to process.
8) The event each have earliest and latest batch ID in which they
must be processed.
9) They must not be processed earlier, and are "failed" if they
cannot be processed before the deadline.
10) The batches are based on the real clock time. For example,
one batch per second.
11) The events of a batch are processed when 2 of the 3 servers
agree on the list of events to process for that batch (quorum).
12) The "third" server then has to wait until it possesses all the
required events before it can process that batch too.
13) Once an event was processed or failed, the source has to be
informed.
14) [EDIT] Events from one source must be processed (or failed) in
the order of their ID/timestamp, but there is no causality
between different sources.
Less formally, I have those servers that receive events. They start with the same initial state, and should keep in sync by agreeing on which event to process in which order. Luckily for me, the events are not to be processed ASAP, but "in a bit", so that I have some time to get the servers to agree before the deadline. But I'm not sure if that actually make any real difference to the algorithms. And if all servers agree on all batches, then they will always be in sync, therefore presenting a consistent view when queried.
While I would be most happy with a Java API, I would accept something else if I can call it from Java. And if there is no open-source API, but a clear algorithm, I would also take that as an answer and try to implement it myself.
Looking at the question and your follow-up there probably "wasn't" an API to satisfy your requirements. To day you could take a look at the Kafka (from LinkedIn)
Apache Kafka
And the general concept of "a log" entity, in what folks like to call 'big data':
The Log: What every software engineer should know about real-time data's unifying abstraction
Actually for your question, I'd begin with the blog about "the log". In my terms the way it works -- And Kafka isn't the only package out doing log handling -- Works as follows:
Instead of a queue based message-passing / publish-subscribe
Kafka uses a "log" of messages
Subscribers (or end-points) can consume the log
The log guarantees to be "in-order"; it handles giga-data, is fast
Double check on the guarantee, there's usually a trade-off for reliability
You just read the log, I think reads are destructive as the default.
If there's a subscriber group, everyone can 'read' before the log-entry dies.
The basic handling (compute) process for the log, is a Map-Reduce-Filter model so you read-everything really fast; keep only stuff you want; process it (reduce) produce outcome(s).
The downside seems to be you need clusters and stuff to make it really shine. Since different servers or sites was mentioned I think we are still on track. I found it a finicky to get up-and-running with the Apache downloads because the tend to assume non-Windows environments (ho hum).
The other 'fast' option would be
Apache Apollo
Which would need you to do the plumbing for connecting different servers. Since the requirements include ...
servers that receive events. They start with the same initial state, and should keep in sync by agreeing on which event to process in which order. Luckily for me, the events are not to be processed ASAP, but "in a bit", so that I have some time to get the servers to agree before the deadline
I suggest looking at a "Getting Started" example or tutorial with Kafka and then looking at similar ZooKeeper organised message/log software(s). Good luck and Enjoy!
So far I haven't got a clear answer, but I think it would be useful anyone interested to see what I found.
Here are some theoretical discussions related to the subject:
Dynamic Vector Clocks for Consistent Ordering of Events
Conflict-free Replicated Data Types
One way of making multiple concurent process wait for each other, which I could use to synchronize the "batches" is a distributed barrier. One Java implementation seems to be available on top of Hazelcast and another uses ZooKeeper
One simpler alternative I found is to use a DB. Every process inserts all events it receives into the DB. Depending on the DB design, this can be fully concurrent and lock-free, like in VoltDB, for example. Then at regular interval of one second, some "cron job" runs that selects and marks the events to be processed in the next batch. The job can run on every server. The first to run the job for one batches fixes the set of events, so that the others just get to use the list that the first one defined. Like that we have a guarantee that all batches contain the same set of event on all servers. And if we can use a complete order over the whole batch, which the cron job could specify itself, then the state of the servers will be kept in sync.
Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?
GAE Documentation says that FIFO is one of the factors that affect task execution order, but the same documentation says that “the system's scheduling may 'jump' new tasks to the head of the queue” and I have confirmed this behavior with a test. The effect: my events are being processed out of order.
Docs says:
https://developers.google.com/appengine/docs/java/taskqueue/overview-push
The order in which tasks are executed depends on several factors:
The position of the task in the queue. App Engine attempts to process tasks based on FIFO > (first in, first out) order. In general, tasks are inserted into the end of a queue, and
executed from the head of the queue.
The backlog of tasks in the queue. The system attempts to deliver the lowest latency
possible for any given task via specially optimized notifications to the scheduler.
Thus, in the case that a queue has a large backlog of tasks, the
system's scheduling may "jump" new tasks to the head of the queue.
The value of the task's etaMillis property. This property specifies the
earliest time that a task can execute. App Engine always waits until
after the specified ETA to process push tasks.
The value of the task's countdownMillis property. This property specifies the minimum
number of seconds to wait before executing a task. Countdown and eta
are mutually exclusive; if you specify one, do not specify the other.
What I need to do? In my use case, I'll process 1-2 million events/day coming from vehicles. These events can be sent at any interval (1 sec, 1 minute or 1 hour). The order of the event processing has to be assured. I need process by timestamp order, which is generated on a embedded device inside the vehicle.
What I have now?
A Rest servlet that is called by the consumer and creates a Task (Event data is on payload).
After this, a worker servlet get this Task and:
Deserialize Event data;
Put Event on Datastore;
Update Vehicle On Datastore.
So, again, is there any way to assure just FIFO behavior? Or how can I improve this solution to get this?
You need to approach this with three separate steps:
Implement a Sharding Counter to generate a monotonically
increasing ID. As much as I like to use the timestamp from
Google's server to indicate task ordering, it appears that timestamps
between GAE servers might vary more than what your requirement is.
Add your tasks to a Pull Queue instead of a Push Queue. When
constructing your TaskOption, add the ID obtained from Step #1 as a tag.
After adding the task, store the ID somewhere on your datastore.
Have your worker servlet lease Tasks by a certain tag from the Pull Queue.
Query the datastore to get the earliest ID that you need to fetch, and use the ID as
the lease tag. In this way, you can simulate FIFO behavior for your task queue.
After you finished your processing, delete the ID from your datastore, and don't forget to delete the Task from your Pull Queue too. Also, I would recommend you run your task consumptions on the Backend.
UPDATE:
As noted by Nick Johnson and mjaggard, sharding in step #1 doesn't seem to be viable to generate a monotonically increasing IDs, and other sources of IDs would then be needed. I seem to recall you were using timestamps generated by your vehicles, would it be possible to use this in lieu of a monotonically increasing ID?
Regardless of the way to generate the IDs, the basic idea is to use datastore's query mechanism to produce a FIFO ordering of Tasks, and use task Tag to pull specific task from the TaskQueue.
There is a caveat, though. Due to the eventual consistency read policy on high-replication datastores, if you choose HRD as your datastore (and you should, the M/S is deprecated as of April 4th, 2012), there might be some stale data returned by the query on step #2.
I think the simple answer is "no", however partly in order to help improve the situation, I am using a pull queue - pulling 1000 tasks at a time and then sorting them. If timing isn't important, you could sort them and put them into the datastore and then complete a batch at a time. You've still got to work out what to do with the tasks at the beginning and ends of the batch - because they might be out of order with interleaving tasks in other batches.
Ok. This is how I've done it.
1) Rest servlet that is called from the consumer:
If Event sequence doesn't match Vehicle sequence (from datastore)
Creates a task on a "wait" queue to call me again
else
State validation
Creates a task on the "regular" queue (Event data is on payload).
2) A worker servlet gets the task from the "regular" queue, and so on... (same pseudo code)
This way I can pause the "regular" queue in order to do a data maintenance without losing events.
Thank you for your answers. My solution is a mix of them.
You can put the work to be done in a row in the datastore with a create timestamp and then fetch work tasks by that timestamp, but if your tasks are being created too quickly you will run into latency issues.
Don't know the answer myself, but it may be possible that tasks enqueued using a deferred function might execute in order submitted. Likely you will need an engineer from G. to get an answer. Pull queues as suggested seem a good alternative, plus this would allow you to consider batching your put()s.
One note about sharded counters: they increase the probability of monotonically increasing ids, but do not guarantee them.
The best way to handle this, the distributed way or "App Engine way" is probably to modify your algorithm and data collection to work with just a timestamp, allowing arbitrary ordering of tasks.
Assuming this is not possible or too difficult, you could modify your algorithm as follow:
when creating the task don't put the data on payload but in the datastore, in a Kind with an ordering on timestamps and stored as a child entity of whatever entity you're trying to update (Vehicule?). The timestamps should come from the client, not the server, to guarantee the same ordering.
run a generic task that fetch the data for the first timestamp, process it, and then delete it, inside a transaction.
Following this thread, I am unclear as to whether the strict FIFO requirement is for all transactions received, or on a per-vehicle basis. Latter has more options vs. former.
In my Java app, sometimes my users do some work that requires a datastore write, but I don't want to keep the user waiting while the datastore is writing. I want to immediately return a response to the user while the data is stored in the background.
It seems fairly clear that I could do this by using GAE task queues, enqueueing a task to store the data. But I also see that there's an Async datastore API, which seems like it would be much easier than dealing with task queues.
Can I just call AsyncDatastoreService.put() and then return from my servlet? Will that API store my data without keeping my users waiting?
I think you are right that the Async calls seem easier. However, the docs for AsyncDatastore mention one caveat that you should consider:
Note: Exceptions are not thrown until you call the get() method. Calling this method allows you to verify that the asynchronous operation succeeded.
The "get" in that note is being called on the Future object returned by the async call. If you just return from your servlet without ever calling get on the Future object, you might not know for sure whether your put() worked.
With a queued task, you can handle the error cases more explicitly, or just rely on the automatic retries. If all you want to queue is datastore puts, you should be able to create (or find) a utility class that does most of the work for you.
Unfortunately, there aren't any really good solutions here. You can enqueue a task, but there's several big problems with that:
Task payloads are limited in size, and that size is smaller than the entity size limit.
Writing a record to the datastore is actually pretty fast, in wall-clock time. A significant part of the cost, too, is serializing the data, which you have to do to add it to the task queue anyway.
By using the task queue, you're creating more eventual consistency - the user may come back and not see their changes applied, because the task has not yet executed. You may also be introducing transaction issues - how do you handle concurrent updates?
If something fails, it could take an arbitrarily long time to apply the user's updates. In such situations, it probably would have been better to simply return an error to the user.
My recommendation would be to use the async API where possible, but to always write to the datastore directly. Note that you need to wait on all your outstanding API calls, as Peter points out, or you won't know if they failed - and if you don't wait on them, the app server will, before returning a response to the user.
If all you need is for the user to have a responsive interface while stuff churns in the back on the db, all you have to do is make an asynchronous call at the client level, aka do some ajax that sends the db write request, changes imemdiatelly the users display, and then upon an ajax request callback update the view with whatever is it you wish.
You can easily add GWT support to you GAE project (either via eclipse plugin or maven gae plugin) and have the time of your life doing asynchronous stuff.