Thread-safe task batching

Thread-safe task batching - java

I'm using appengine servers. I expect to get many requests (dozens) in close proximity that will put some of my data in an inconsistent state. The cleanup of that data can be efficiently batched - for example, it would be best to run my cleanup code just once, after the dozens of requests have all completed. I don't know exactly how many requests there will be, or how close together they will be. It is OK if the cleanup code is run multiple times, but it must be run after the last request.
What's the best way to minimize the number of cleanup runs?
Here's my idea:
public void handleRequest() {
manipulateData();
if (memCacheHasCleanupToken()) {
return; //yay, a cleanup is already scheduled
} else {
scheduleDeferredCleanup(5 seconds from now);
addCleanupTokenToMemCache();
}
}
...
public void deferredCleanupMethod() {
removeCleanupTokenFromMemcache();
cleanupData();
}
I think this will break down because cleanupData might receive outdated data even after some request has found that there IS a cleanup token in the memcache (HRD latency, etc), so some data might be missed in the cleanup.
So, my questions:
Will this general strategy work? Maybe if I use a transactional lock on a datastore entity?
What strategy should I use?

The general strategy you suggest will work, providing the data that needs cleaning up isn't stored on each instance (eg, it's in the datastore or memcache), and provided your schduleDeferredCleanup method uses the task queue. An optimization would be to use task names that are based on the time interval in which they run to avoid scheduling duplicate cleanups if the memcache key expires.
One issue to watch out for with the procedure you describe above, though, is race conditions. As stated, a request being processed at the same time as the cleanup task may check memcache, observe the token is there, and neglect to enqueue a cleanup task, whilst the cleanup task has already finished, but not yet removed the memcache key. The easiest way to avoid this is to make the memcache key expire on its own, but before the related task will execute. That way, you may schedule duplicate cleanup tasks, but you should never omit one that's required.

Related

Spring/Hibernate: Best option for dealing with Galera/MySql/Mariadb replication delay during asynchronous processing

In my service I have an endpoint that creates a resource. After creation the resource needs to be validated and otherwise "prepared" for future processing. To accomplish this my service creates the resource in the database, spawns an asynchronous thread to perform the validation, and then returns to the user.
Entry point:
#Override
public FooDto createFoo(FooDto fooDto) {
FooDto retDto = fooService.createFoo(fooDto); //Annotated with #Transactional
asyncFooService.initializeFooAsync(retDto.getFooId()); //Annotated with #Transactional and #Async
return retDto;
}
Async call
#Transactional
#Async
#Override
public void initializeFooAsync(String foodId) {
Foo foo = fooRepository.findById(foodId);
logger.info("Found foo with id={}", foo.getId())
//More processing which can take a while to perform
}
I was careful to ensure that I have exited the transactional boundaries so that the commit would run before the async call would happen, and that the async method lives in a different bean than the entry method. So logically this should have no issues seeing the data from the first method in the second, and the second should be running asynchronously.
What I have noticed is that the log message in the async call is sometimes throwing a null pointer exception as foo is null. By the time I get notified of this I can see in the database that the wanted foo record exists.
My persistance layer consists of three MySQL or MariaDB replicas (depending on the enviornment) in "master/master" configuration, so what I have derived is that the insert done in fooService.createFoo is going to nodeA, and the select done by initializeFooAsync is going to nodeB which has yet to persist the replication from nodeA. The further evidence I have for this is I have done a patch which, in the initializeFooAsync method checks for a null Food and tries to find it again after 3 seconds. This patch has worked.
I'm looking for other, "cleaner" approaches that don't utilize thread.sleep. The other approach that I thought of was using RMQ (which is available to me) and dead letter exchanges to create a delayed processing queue with limited amount of retries should Foo not be found (so if not found try again in Xms up to Y times). However this approach is being frowned upon by the chief architect.
The other approach I see is to do more of the same, and just do more checks in initializeFooAsync at shorter intervals to minimize wait time. Regardless it would essentially be the same solution using Thread.sleep to deal with replication delay.
Doing the initialization inline with the creation is not possible as this is a specific requirement from product, and the initialization may end up taking what they consider a "significant" amount of time due to coordination.
Is there some other utility or tool in the Spring/Java ecosystem that can help me deliver a cleaner approach? Preferably something that doesn't rely on sleeping my thread.

How can I ensure that my Android app doesn't access a file simultaneously?

I am building a fitness app which continually logs activity on the device. I need to log quite often, but I also don't want to unnecessarily drain the battery of my users which is why I am thinking about batching network calls together and transmitting them all at once as soon as the radio is active, the device is connected to a WiFi or it is charging.
I am using a filesystem based approach to implement that. I persist the data first to a File - eventually I might use Tape from Square to do that - but here is where I encounter the first issues.
I am continually writing new log data to the File, but I also need to periodically send all the logged data to my backend. When that happens I delete the contents of the File. The problem now is how can I prevent both of those operations from happening at the same time? Of course it will cause problems if I try to write log data to the File at the same time as some other process is reading from the File and trying to delete its contents.
I am thinking about using an IntentService essentially act as a queue for all those operations. And since - at least I have read as much - an IntentServices handles Intents sequentially in single worker Thread it shouldn't be possible for two of those operations to happen at the same time, right?
Currently I want to schedule a PeriodicTask with the GcmNetworkManager which would take care of sending the data to the server. Is there any better way to do all this?

1) You are overthinking this whole thing!
Your approach is way more complicated than it has to be! And for some reason none of the other answers point this out, but GcmNetworkManager already does everything you are trying to implement! You don't need to implement anything yourself.
2) Optimal way to implement what you are trying to do.
You don't seem to be aware that GcmNetworkManager already batches calls in the most battery efficient way with automatic retries etc and it also persists the tasks across device boots and can ensure their execution as soon as is battery efficient and required by your app.
Just whenever you have data to save schedule a OneOffTask like this:
final OneoffTask task = new OneoffTask.Builder()
// The Service which executes the task.
.setService(MyTaskService.class)
// A tag which identifies the task
.setTag(TASK_TAG)
// Sets a time frame for the execution of this task in seconds.
// This specifically means that the task can either be
// executed right now, or must have executed at the lastest in one hour.
.setExecutionWindow(0L, 3600L)
// Task is persisted on the disk, even across boots
.setPersisted(true)
// Unmetered connection required for task
.setRequiredNetwork(Task.NETWORK_STATE_UNMETERED)
// Attach data to the task in the form of a Bundle
.setExtras(dataBundle)
// If you set this to true and this task already exists
// (just depends on the tag set above) then the old task
// will be overwritten with this one.
.setUpdateCurrent(true)
// Sets if this task should only be executed when the device is charging
.setRequiresCharging(false)
.build();
mGcmNetworkManager.schedule(task);
This will do everything you want:
The Task will be persisted on the disk
The Task will be executed in a batched and battery efficient way, preferably over Wifi
You will have configurable automatic retries with a battery efficient backoff pattern
The Task will be executed within a time window you can specify.
I suggest for starters you read this to learn more about the GcmNetworkManager.
So to summarize:
All you really need to do is implement your network calls in a Service extending GcmTaskService and later whenever you need to perform such a network call you schedule a OneOffTask and everything else will be taken care of for you!
Of course you don't need to call each and every setter of the OneOffTask.Builder like I do above - I just did that to show you all the options you have. In most cases scheduling a task would just look like this:
mGcmNetworkManager.schedule(new OneoffTask.Builder()
.setService(MyTaskService.class)
.setTag(TASK_TAG)
.setExecutionWindow(0L, 300L)
.setPersisted(true)
.setExtras(bundle)
.build());
And if you put that in a helper method or even better create factory methods for all the different tasks you need to do than everything you were trying to do should just boil down to a few lines of code!
And by the way: Yes, an IntentService handles every Intent one after another sequentially in a single worker Thread. You can look at the relevant implementation here. It's actually very simple and quite straight forward.

All UI and Service methods are by default invoked on the same main thread. Unless you explicitly create threads or use AsyncTask there is no concurrency in an Android application per se.
This means that all intents, alarms, broad-casts are by default handled on the main thread.
Also note that doing I/O and/or network requests may be forbidden on the main thread (depending on Android version, see e.g. How to fix android.os.NetworkOnMainThreadException?).
Using AsyncTask or creating your own threads will bring you to concurrency problems but they are the same as with any multi-threaded programming, there is nothing special to Android there.
One more point to consider when doing concurrency is that background threads need to hold a WakeLock or the CPU may go to sleep.

Just some idea.
You may try to make use of serial executor for your file, therefore, only one thread can be execute at a time.
http://developer.android.com/reference/android/os/AsyncTask.html#SERIAL_EXECUTOR

Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?

Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?
GAE Documentation says that FIFO is one of the factors that affect task execution order, but the same documentation says that “the system's scheduling may 'jump' new tasks to the head of the queue” and I have confirmed this behavior with a test. The effect: my events are being processed out of order.
Docs says:
https://developers.google.com/appengine/docs/java/taskqueue/overview-push
The order in which tasks are executed depends on several factors:
The position of the task in the queue. App Engine attempts to process tasks based on FIFO > (first in, first out) order. In general, tasks are inserted into the end of a queue, and
executed from the head of the queue.
The backlog of tasks in the queue. The system attempts to deliver the lowest latency
possible for any given task via specially optimized notifications to the scheduler.
Thus, in the case that a queue has a large backlog of tasks, the
system's scheduling may "jump" new tasks to the head of the queue.
The value of the task's etaMillis property. This property specifies the
earliest time that a task can execute. App Engine always waits until
after the specified ETA to process push tasks.
The value of the task's countdownMillis property. This property specifies the minimum
number of seconds to wait before executing a task. Countdown and eta
are mutually exclusive; if you specify one, do not specify the other.
What I need to do? In my use case, I'll process 1-2 million events/day coming from vehicles. These events can be sent at any interval (1 sec, 1 minute or 1 hour). The order of the event processing has to be assured. I need process by timestamp order, which is generated on a embedded device inside the vehicle.
What I have now?
A Rest servlet that is called by the consumer and creates a Task (Event data is on payload).
After this, a worker servlet get this Task and:
Deserialize Event data;
Put Event on Datastore;
Update Vehicle On Datastore.
So, again, is there any way to assure just FIFO behavior? Or how can I improve this solution to get this?

You need to approach this with three separate steps:
Implement a Sharding Counter to generate a monotonically
increasing ID. As much as I like to use the timestamp from
Google's server to indicate task ordering, it appears that timestamps
between GAE servers might vary more than what your requirement is.
Add your tasks to a Pull Queue instead of a Push Queue. When
constructing your TaskOption, add the ID obtained from Step #1 as a tag.
After adding the task, store the ID somewhere on your datastore.
Have your worker servlet lease Tasks by a certain tag from the Pull Queue.
Query the datastore to get the earliest ID that you need to fetch, and use the ID as
the lease tag. In this way, you can simulate FIFO behavior for your task queue.
After you finished your processing, delete the ID from your datastore, and don't forget to delete the Task from your Pull Queue too. Also, I would recommend you run your task consumptions on the Backend.
UPDATE:
As noted by Nick Johnson and mjaggard, sharding in step #1 doesn't seem to be viable to generate a monotonically increasing IDs, and other sources of IDs would then be needed. I seem to recall you were using timestamps generated by your vehicles, would it be possible to use this in lieu of a monotonically increasing ID?
Regardless of the way to generate the IDs, the basic idea is to use datastore's query mechanism to produce a FIFO ordering of Tasks, and use task Tag to pull specific task from the TaskQueue.
There is a caveat, though. Due to the eventual consistency read policy on high-replication datastores, if you choose HRD as your datastore (and you should, the M/S is deprecated as of April 4th, 2012), there might be some stale data returned by the query on step #2.

I think the simple answer is "no", however partly in order to help improve the situation, I am using a pull queue - pulling 1000 tasks at a time and then sorting them. If timing isn't important, you could sort them and put them into the datastore and then complete a batch at a time. You've still got to work out what to do with the tasks at the beginning and ends of the batch - because they might be out of order with interleaving tasks in other batches.

Ok. This is how I've done it.
1) Rest servlet that is called from the consumer:
If Event sequence doesn't match Vehicle sequence (from datastore)
Creates a task on a "wait" queue to call me again
else
State validation
Creates a task on the "regular" queue (Event data is on payload).
2) A worker servlet gets the task from the "regular" queue, and so on... (same pseudo code)
This way I can pause the "regular" queue in order to do a data maintenance without losing events.
Thank you for your answers. My solution is a mix of them.

You can put the work to be done in a row in the datastore with a create timestamp and then fetch work tasks by that timestamp, but if your tasks are being created too quickly you will run into latency issues.

Don't know the answer myself, but it may be possible that tasks enqueued using a deferred function might execute in order submitted. Likely you will need an engineer from G. to get an answer. Pull queues as suggested seem a good alternative, plus this would allow you to consider batching your put()s.
One note about sharded counters: they increase the probability of monotonically increasing ids, but do not guarantee them.

The best way to handle this, the distributed way or "App Engine way" is probably to modify your algorithm and data collection to work with just a timestamp, allowing arbitrary ordering of tasks.
Assuming this is not possible or too difficult, you could modify your algorithm as follow:
when creating the task don't put the data on payload but in the datastore, in a Kind with an ordering on timestamps and stored as a child entity of whatever entity you're trying to update (Vehicule?). The timestamps should come from the client, not the server, to guarantee the same ordering.
run a generic task that fetch the data for the first timestamp, process it, and then delete it, inside a transaction.

Following this thread, I am unclear as to whether the strict FIFO requirement is for all transactions received, or on a per-vehicle basis. Latter has more options vs. former.

locking DB records for concurrency between threads

This kind of thing has been done a million times I'm sure, but my search foo appears weak today, and I'd like to get opinions on what is generally considered the best way to accomplish this goal.
My application keeps track of sessions for online users in a system. Each session corresponds to a single record in a database. A session can be ended in one of two ways. Either a "stop" message is received, or the session can timeout. The former case is easy, it is handled in the message processing thread and everything is fine. The latter case is where the concern comes from.
In order to process timeouts, each record has an ending time column that is updated each time a message is received for that session. To make timeouts work, I have a thread that returns all records from the database whose endtime < NOW() (has an end time in the past), and goes through the processing to close those sessions. The problem here is that it's possible that I might receive a message for a session while the timeout thread is going through processing for the same session. I end up with a race between the timeout thread and message processing thread.
I could use a semaphore or the like and just prevent the message thread from processing while timeout is taking place as it only needs to run every 30 seconds or a minute. However, as the user table gets large this is going to run into some serious performance issues. What I think I would like is a way to know in the message thread that this record is currently being processed by the timeout thread. If I could achieve that I could either discard the message or wait for timeout thread to end but only in the case of conflicts now instead of always.
Currently my application uses JDBC directly. Would there be an easier/standard method for solving this issue if I used a framework such as Hibernate?

This is a great opportunity for all kinds of crazy bugs to occur, and some of the cures can cause performance issues.
The classic solution would be to use transactions (http://dev.mysql.com/doc/refman/5.0/en/commit.html). This allows you to guarantee the consistency of your data - but a long-running transaction on the database turns it into a huge bottleneck; if your "find timed-out sessions" code runs for a minute, the transaction may run for that entire period, effectively locking write access to the affected table(s). Most systems would not deal well with this.
My favoured solution for this kind of situation is to have a "state machine" for status; I like to implement this as a history table, but that does tend to lead to a rapidly growing database.
You define the states of a session as "initiated", "running", "timed-out - closing", "timed-out - closed", and "stopped by user" (for example).
You implement code which honours the state transition logic in whatever data access logic you've got. The pseudo code for your "clean-up" script might then be:
update all records whose endtime < now() and whose status is "running, set status = "timed-out - closing"
for each record whose status is "timed-out - closing"
do whatever other stuff you need to do
update that record to set status "timed-out - closed" where status = "timed-out - closing"
next record
All other attempts to modify the current state of the session record must check that the current status is valid for the attempted change.
For instance, the "manual" stop code should be something like this:
update sessions
set status = "stopped by user"
where session_id = xxxxx
and status = 'running'
If the auto-close routine has kicked off in the time between showing the user interface and the database code, the where clause won't match any records, so the rest of the code simply doesn't run.
For this to work, all code that modifies the session status must check its pre-conditions; the most maintainable way is to encode status and allowed transitions into a separate database table.
You could also write triggers to enforce this logic, though I'm normally not a fan of triggers - only do this if you have to.
I don't think this adds significant performance worries - but test and optimize. The majority of the extra work on the database is by adding extra "where" clauses to your update statements; assuming you have an index on status, it's unlikely to have a measurable impact.

Write to GAE datastore asynchronously

In my Java app, sometimes my users do some work that requires a datastore write, but I don't want to keep the user waiting while the datastore is writing. I want to immediately return a response to the user while the data is stored in the background.
It seems fairly clear that I could do this by using GAE task queues, enqueueing a task to store the data. But I also see that there's an Async datastore API, which seems like it would be much easier than dealing with task queues.
Can I just call AsyncDatastoreService.put() and then return from my servlet? Will that API store my data without keeping my users waiting?

I think you are right that the Async calls seem easier. However, the docs for AsyncDatastore mention one caveat that you should consider:
Note: Exceptions are not thrown until you call the get() method. Calling this method allows you to verify that the asynchronous operation succeeded.
The "get" in that note is being called on the Future object returned by the async call. If you just return from your servlet without ever calling get on the Future object, you might not know for sure whether your put() worked.
With a queued task, you can handle the error cases more explicitly, or just rely on the automatic retries. If all you want to queue is datastore puts, you should be able to create (or find) a utility class that does most of the work for you.

Unfortunately, there aren't any really good solutions here. You can enqueue a task, but there's several big problems with that:
Task payloads are limited in size, and that size is smaller than the entity size limit.
Writing a record to the datastore is actually pretty fast, in wall-clock time. A significant part of the cost, too, is serializing the data, which you have to do to add it to the task queue anyway.
By using the task queue, you're creating more eventual consistency - the user may come back and not see their changes applied, because the task has not yet executed. You may also be introducing transaction issues - how do you handle concurrent updates?
If something fails, it could take an arbitrarily long time to apply the user's updates. In such situations, it probably would have been better to simply return an error to the user.
My recommendation would be to use the async API where possible, but to always write to the datastore directly. Note that you need to wait on all your outstanding API calls, as Peter points out, or you won't know if they failed - and if you don't wait on them, the app server will, before returning a response to the user.

If all you need is for the user to have a responsive interface while stuff churns in the back on the db, all you have to do is make an asynchronous call at the client level, aka do some ajax that sends the db write request, changes imemdiatelly the users display, and then upon an ajax request callback update the view with whatever is it you wish.
You can easily add GWT support to you GAE project (either via eclipse plugin or maven gae plugin) and have the time of your life doing asynchronous stuff.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.