I have a task worker written in Java and using a MongoDB 3.4 replica set that runs many threads each doing essentially this.
Run task
Signal that task is complete by updating a document for that task in MongoDB
Run a query to see if all the tasks in this set of tasks are done
If so, continue to next stage of processing
Otherwise, do nothing
As you may be able to see, there is a race condition here; multiple tasks can all finish at about the same time and think that they are the last task to complete. I want to use MongoDB to make sure only one of those tasks is allowed to start the next stage of processing.
I have the following code that is meant to ensure that only one of those tasks can continue (I'm using Jongo to interface with MongoDB).
Chipset modified = chipsets
.findAndModify("{_id: #, status: {$ne: #}}", new Object[] { chipset.getId(), Chipset.Status.Queued })
.with("{$set: {status: #}}", new Object[] { Chipset.Status.Queued })
.returnNew().as(Chipset.class);
if (modified != null)
runNextProcessingStep();
Pretty simple here; I'm just using findAndModify to change the status of the Chipset (set of tasks) to Queued. The one that successfully makes the change gets to execute runNextProcessingStep().
Or that's how I think it should work. In reality, several tasks, even ones that finish 2 seconds apart, are somehow getting back a non-null modified. As I understand it MongoDB should be locking the document when running findAndModify so that a non-null document can be returned no more than once.
I've read Linearizable Reads via findAndModify and have implemented everything said in there. I've set the connection write concern to Majority and the read concern to Linearizable. I've created a unique composite index on _id and status. Still nothing. Perhaps I have misunderstood how findAndModify actually behaves? What am I doing wrong?
Well, this is embarrassing but in the interest of being a good internet citizen I'll update this with what happened. There was another thread that was changing statuses out from under me. I had convinced myself this couldn't be the case but, well, concurrency can be a real pain sometimes. findAndModify works exactly how I thought it should.
Related
I have a scenario where two functionalities run parallel.
Below is sample pseudo code.
MainActor{
// retrive company ids
//for each company id i need to run another two different actions simultaniously
tell(A_Actor)
tell(B_Actor)
//if I call above they are calling sequentially i.e. first it runs tell(A_Actor)
//then comes to tell(B_Actor).
//If tell(A_Actor) fails it wont run tell(B_Actor).
}
A_Actor{
// do ingest into a file.
}
B_Actor{
// do ingest into a DB.
}
Question :
How to run two functionalities i.e. tell(A_Actor) & tell(B_Actor) run parallel?
The tell method is asynchronous. When you fire a tellto actorA, it doesn't wait until actorA finishes or crashes to execute the next action, which here is to tell actorB
If you need to paralelize the two tell methods, then you can do the following :
val tellActions = Vector(() => actorA.tell(messageA, senderActor), () => actorB.tell(messageB, senderActor))
tellActions.par.foreach(_.apply())
Note that this is Scala code
This has been pointed out in several comments (including mine), but I felt it deserved an answer.
In short, you need to distinguish between calling the tell method in parallel with the functionality that the actors execute within their receive methods being executed in parallel. The functionality will be executed in parallel automatically, and calling the tell method in parallel doesn't make any sense.
The code you show will execute the ingest in a file and ingest into the DB in parallel. This is automatic and requires no action on your part; this is how actors and tell works. And, despite what you say, if something goes wrong with the file ingestion it will not affect the ingestion into the DB. (Assuming you built the actors and messages correctly, since you don't list their implementation.)
The tell method is asynchronous: it returns nearly immediately and doesn't do the actual logic (ingestion in this case): the only thing it does is place the message in the recipient's mailbox. Ismail's answer, in theory, shows you how you could "invoke tell" in parallel, but in that example you "sequentially" are creating the array that is used for parallel tells and the whole process will be very inefficient.) His code, while technically doing what you ask, is nonsensical in practice: it accomplishes nothing except slowing the code down significantly.
In short, I think you either:
Have something fundamentally wrong with your actors and how you are calling them.
You are actually are executing the functionality in parallel and you just aren't realizing it because you are measuring/observing something incorrectly.
I have a an issue that's very weird to me.
I'm writing a multithreaded client-server framework, and so far, it works pretty well, except one thing.
Consider the following image:
The client can request tasks, which are added to a queue. If any elements are in this queue, they are polled and added to an "executing" queue. The task in question is then executed on a separate thread using an ExecutorService.
The "executing" queue is checked for tasks that has completed whatever it is they're doing, and moved to a "completed" queue. This queue is checked, and replies are dispatched to the appropriate clients.
This all works, except when there are more than one task running in the system.
Each task is held in a TaskRequest object and each task can refer back to its "host" TaskRequest. However, it appears as though the reference from the task is different from the reference of the... well... actual TaskRequest
On the image, I've highlighted the TaskRequest and the resultBag to show they have different addresses and IDs.
This, as I mentioned, is only the case when more than one task is in the system and it is beyond flabbergasting to me.
The complete field is not updating, despite me having checked this by outputting value of the variable after it is set.
Why is the "host" object not updating?
Below is the code for the classes in question, Pastebinned to reduce space.
TaskRequest code
TaskBase code (extended by other tasks)
TaskQueue code (keeping of all the tasks, requested, executing and completed, respectively)
TaskExecutor code (running over the TaskQueue instance, executing tasks, etc)
I am sorry for posting a bit of a wall of text.
I'm using JDBC, need to constantly check the database against changing values.
What I have currently is an infinite loop running, inner loop iterating over a changing values, and each iteration checking against the database.
public void runInBG() { //this method called from another thread
while(true) {
while(els.hasElements()) {
Test el = (Test)els.next();
String sql = "SELECT * FROM Test WHERE id = '" + el.getId() + "'";
Record r = db.getTestRecord(sql);//this function makes connection, executeQuery etc...and return Record object with values
if(r != null) {
//do something
}
}
}
}
I'm think this isn't the best way.
The other way I'm thinking is the reverse, to keep iterating over the database.
UPDATE
Thank you for the feedback regarding timers, but I don't think it will solve my problem.
Once a change occurs in the database I need to process the results almost instantaneously against the changing values ("els" from the example code).
Even if the database does not change it still has to check constantly against the changing values.
UPDATE 2
OK, to anyone interested in the answer I believe I have the solution now. Basically the solution is NOT to use the database for this. Load in, update, add, etc... only whats needed from the database to memory.
That way you don't have to open and close the database constantly, you only deal with the database when you make a change to it, and reflect those changes back into memory and only deal with whatever is in memory at the time.
Sure this is more memory intensive but performance is absolute key here.
As to the periodic "timer" answers, I'm sorry but this is not right at all. Nobody has responded with a reason how the use of timers would solve this particular situation.
But thank you again for the feedback, it was still helpful nevertheless.
Another possibility would be using ScheduledThreadPoolExecutor.
You could implement a Runnable containing your logic and register it to the ScheduledExecutorService as follows:
ScheduledThreadPoolExecutor executor = new ScheduledThreadPoolExecutor(10);
executor.scheduleAtFixedRate(myRunnable, 0, 5, TimeUnit.SECONDS);
The code above, creates a ScheduledThreadPoolExecutor with 10 Threads in its pool, and would have a Runnable registered to it that will run in a 5 seconds period starting immediately.
To schedule your runnable you could use:
scheduleAtFixedRate
Creates and executes a periodic action that becomes enabled first after the given initial delay, and subsequently with the given period; that is executions will commence after initialDelay then initialDelay+period, then initialDelay + 2 * period, and so on.
scheduleWithFixedDelay
Creates and executes a periodic action that becomes enabled first after the given initial delay, and subsequently with the given delay between the termination of one execution and the commencement of the next.
And here you can see the advantages of ThreadPoolExecutor, in order to see if it fits to your requirements. I advise this question: Java Timer vs ExecutorService? too in order to make a good decision.
Keeping the while(true) in the runInBG() is a bad idea. You better remove that. Instead you can have a Scheduler/Timer(use Timer & TimerTask) which would call the runInBG() periodically and check for the updates in the DB.
u could use a timer--->
Timer timer = new Timer("runInBG");
//Taking an instance of class contains your repeated method.
MyClass t = new MyClass();
timer.schedule(t, 0, 2000);
As you said in the comment above, if application controls the updates and inserts then you can create a framework which notifies for 'BG' thread or process about change in database. Notification can be over network via JMS or intra VM using observer pattern or both local and remote notifications.
You can have generic notification message like (it can be class for local notification or text message for remote notifications)
<Notification>
<Type>update/insert</Type>
<Entity>
<Name>Account/Customer</Name>
<Id>id</Id>
<Entity>
</Notification>
To avoid a 'busy loop', I would try to use triggers. H2 also supports a DatabaseEventListener API, that way you wouldn't have to create a trigger for each table.
This may not always work, for example if you use a remote connection.
UPDATE 2
OK, to anyone interested in the answer I believe I have the solution now. Basically the solution is NOT to use the database for this. Load in, update, add, etc... only whats needed from the database to memory. That way you don't have to open and close the database constantly, you only deal with the database when you make a change to it, and reflect those changes back into memory and only deal with whatever is in memory at the time. Sure this is more memory intensive but performance is absolute key here.
Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?
GAE Documentation says that FIFO is one of the factors that affect task execution order, but the same documentation says that “the system's scheduling may 'jump' new tasks to the head of the queue” and I have confirmed this behavior with a test. The effect: my events are being processed out of order.
Docs says:
https://developers.google.com/appengine/docs/java/taskqueue/overview-push
The order in which tasks are executed depends on several factors:
The position of the task in the queue. App Engine attempts to process tasks based on FIFO > (first in, first out) order. In general, tasks are inserted into the end of a queue, and
executed from the head of the queue.
The backlog of tasks in the queue. The system attempts to deliver the lowest latency
possible for any given task via specially optimized notifications to the scheduler.
Thus, in the case that a queue has a large backlog of tasks, the
system's scheduling may "jump" new tasks to the head of the queue.
The value of the task's etaMillis property. This property specifies the
earliest time that a task can execute. App Engine always waits until
after the specified ETA to process push tasks.
The value of the task's countdownMillis property. This property specifies the minimum
number of seconds to wait before executing a task. Countdown and eta
are mutually exclusive; if you specify one, do not specify the other.
What I need to do? In my use case, I'll process 1-2 million events/day coming from vehicles. These events can be sent at any interval (1 sec, 1 minute or 1 hour). The order of the event processing has to be assured. I need process by timestamp order, which is generated on a embedded device inside the vehicle.
What I have now?
A Rest servlet that is called by the consumer and creates a Task (Event data is on payload).
After this, a worker servlet get this Task and:
Deserialize Event data;
Put Event on Datastore;
Update Vehicle On Datastore.
So, again, is there any way to assure just FIFO behavior? Or how can I improve this solution to get this?
You need to approach this with three separate steps:
Implement a Sharding Counter to generate a monotonically
increasing ID. As much as I like to use the timestamp from
Google's server to indicate task ordering, it appears that timestamps
between GAE servers might vary more than what your requirement is.
Add your tasks to a Pull Queue instead of a Push Queue. When
constructing your TaskOption, add the ID obtained from Step #1 as a tag.
After adding the task, store the ID somewhere on your datastore.
Have your worker servlet lease Tasks by a certain tag from the Pull Queue.
Query the datastore to get the earliest ID that you need to fetch, and use the ID as
the lease tag. In this way, you can simulate FIFO behavior for your task queue.
After you finished your processing, delete the ID from your datastore, and don't forget to delete the Task from your Pull Queue too. Also, I would recommend you run your task consumptions on the Backend.
UPDATE:
As noted by Nick Johnson and mjaggard, sharding in step #1 doesn't seem to be viable to generate a monotonically increasing IDs, and other sources of IDs would then be needed. I seem to recall you were using timestamps generated by your vehicles, would it be possible to use this in lieu of a monotonically increasing ID?
Regardless of the way to generate the IDs, the basic idea is to use datastore's query mechanism to produce a FIFO ordering of Tasks, and use task Tag to pull specific task from the TaskQueue.
There is a caveat, though. Due to the eventual consistency read policy on high-replication datastores, if you choose HRD as your datastore (and you should, the M/S is deprecated as of April 4th, 2012), there might be some stale data returned by the query on step #2.
I think the simple answer is "no", however partly in order to help improve the situation, I am using a pull queue - pulling 1000 tasks at a time and then sorting them. If timing isn't important, you could sort them and put them into the datastore and then complete a batch at a time. You've still got to work out what to do with the tasks at the beginning and ends of the batch - because they might be out of order with interleaving tasks in other batches.
Ok. This is how I've done it.
1) Rest servlet that is called from the consumer:
If Event sequence doesn't match Vehicle sequence (from datastore)
Creates a task on a "wait" queue to call me again
else
State validation
Creates a task on the "regular" queue (Event data is on payload).
2) A worker servlet gets the task from the "regular" queue, and so on... (same pseudo code)
This way I can pause the "regular" queue in order to do a data maintenance without losing events.
Thank you for your answers. My solution is a mix of them.
You can put the work to be done in a row in the datastore with a create timestamp and then fetch work tasks by that timestamp, but if your tasks are being created too quickly you will run into latency issues.
Don't know the answer myself, but it may be possible that tasks enqueued using a deferred function might execute in order submitted. Likely you will need an engineer from G. to get an answer. Pull queues as suggested seem a good alternative, plus this would allow you to consider batching your put()s.
One note about sharded counters: they increase the probability of monotonically increasing ids, but do not guarantee them.
The best way to handle this, the distributed way or "App Engine way" is probably to modify your algorithm and data collection to work with just a timestamp, allowing arbitrary ordering of tasks.
Assuming this is not possible or too difficult, you could modify your algorithm as follow:
when creating the task don't put the data on payload but in the datastore, in a Kind with an ordering on timestamps and stored as a child entity of whatever entity you're trying to update (Vehicule?). The timestamps should come from the client, not the server, to guarantee the same ordering.
run a generic task that fetch the data for the first timestamp, process it, and then delete it, inside a transaction.
Following this thread, I am unclear as to whether the strict FIFO requirement is for all transactions received, or on a per-vehicle basis. Latter has more options vs. former.
I want to iterate over records in the database and update them. However since that updating is both taking some time and prone to errors, I need to a) don't keep the db waiting (as e.g. with a ScrollableResults) and b) commit after each update.
Second thing is that this is done in multiple threads, so I need to ensure that if thread A is taking care of a record, thread B is getting another one.
How can I implement this sensibly with hibernate?
To give a better idea, the following code would be executed by several threads, where all threads share a single instance of the RecordIterator:
Iterator<Record> iter = db.getRecordIterator();
while(iter.hasNext()){
Record rec = iter.next();
// do something lengthy here
db.save(rec);
}
So my question is how to implement the RecordIterator. If on every next() I perform a query, how to ensure that I don't return the same record twice? If I don't, which query to use to return detached objects? Is there a flaw in the general approach (e.g. use one RecordIterator per thread and let the db somehow handle synchronization)? Additional info: there are way to many records to locally keep them (e.g. in a set of treated records).
Update: Because the overall process takes some time, it can happen that the status of Records changes. Due to that the ordering of the result of a query can change. I guess to solve this problem I have to mark records in the database once I return them for processing...
Hmmm, what about pushing your objects from a reader thread in some bounded blocking queue, and let your updater threads read from that queue.
In your reader, do some paging with setFirstResult/setMaxResults. E.g. if you have 1000 elements maximum in your queue, fill them up 500 at a time. When the queue is full, the next push will automatically wait until the updaters take the next elements.
My suggestion would be, since you're sharing an instance of the master iterator, is to run all of your threads using a shared Hibernate transaction, with one load at the beginning and a big save at the end. You load all of your data into a single 'Set' which you can iterate over using your threads (be careful of locking, so you might want to split off a section for each thread, or somehow manage the shared resource so that you don't overlap).
The beauty of the Hibernate solution is that the records aren't immediately saved to the database, since you're using a transaction, and are stored in hibernate's cache. Then at the end they'd all be written back to the database at once. This would save on those expensive database writes you're worried about, plus it gives you an actual object to work with on each iteration, instead of just a database row.
I see in your update that the status of the records may change during processing, and this could always cause a problem. If this is a constantly running process or long running, then my advice using a hibernate solution would be to work in smaller sets, and yes, add a flag to mark records that have been updated, so that when you move to the next set you can pick up ones that haven't been touched.