Hazelcast best practices to failover in parallel processing

Hazelcast best practices to failover in parallel processing - java

I'm new to Hazelcast. So have a question about best failure handling practices during parallel processing:
Mastering Hazelcast, section 6.6, p. 96:
Work-queue has no high availability: Each member will create one or
more local ThreadPoolExecutors with ordinary work-queues that do the
real work. When a task is submitted, it will be put on the work-queue
of that ThreadPoolExecutor and will not be backed up by Hazelcast. If
something would happen with that member, all unprocessed work will be
lost.
Task:
Suppose I've got 1 master node and 2 slaves. I launch time consuming task with
executor.submitToAllMembers (new TimeConsumingTask())
So each node is processing something. And while they all are processing something one of the slaves fails
Questions:
That's not possible to rerun the failed member work on another node, right?
Is there any other (preferably better) approach than rerun the whole job set across the whole cluster? (In case if TimeConsumingTask is Runnable)
Is there any other (preferably better) approach than rerun the whole job set across the whole cluster? (In case if TimeConsumingTask is Callable and I want to get a Future as a cluster computation result)

I'm assuming by 'failure handling' you're talking about the scenario where a node in the cluster goes down....
Question 1 Not automatically. You are right in assuming that Hazelcast's execution tasks are not fault tolerant. However, if you were able to handle the failure of a task, I can't see a reason why you couldn't resubmit the work to another member in the cluster.
Question 2 It's difficult to know what your TimeConsumingTask is actually doing - as with any distributed execution engine, it's generally better to compose the long running task as a series of smaller tasks. If you can't compose your task as smaller elements, then no - there's not really a better approach than resubmitting the whole job again
Question 3 The same thing applies to this question as question 2. Returning a Future from a task submission is not going to help you massively if a node fails. Futures provide you with the ability to wait (optionally for a specified timeout period) on the result and provide the possibility of cancelling the task.
Generally, for handling a node failing I would take a look to see whether an ExecutionCallback would help - in this case you get notified on a failure, which I am currently assuming that a node failure falls under this. When your callback is notified of the failure, you could resubmit the job.
You might also want to look at some other approaches that exist outside of the core Hazelcast API. Hazeltask is a project on GitHub that promises failover handling and task resubmission - so that might be worth a look?

Related

JVM: is it possible to manipulate frame stack?

Suppose I need to execute N tasks in the same thread. The tasks may sometimes need some values from an external storage. I have no idea in advance which task may need such a value and when. It is much faster to fetch M values in one go rather than the same M values in M queries to the external storage.
Note that I cannot expect cooperation from tasks themselves, they can be concidered as nothing more than java.lang.Runnable objects.
Now, the ideal procedure, as I see it, would look like
Execute all tasks in a loop. If a task requests an external value, remember this, suspend the task and switch to the next one.
Fetch the values requested at the previous step, all at once.
Remove all completed task (suspended ones don't count as completed).
If there are still tasks left, go to step 1, but instead of executing a task, continue its execution from the suspended state.
As far as I see, the only way to "suspend" and "resume" something would be to remove its related frames from JVM stack, store them somewhere, and later push them back onto the stack and let JVM continue.
Is there any standard (not involving hacking at lower level than JVM bytecode) way to do this?
Or can you maybe suggest another possible way to achieve this (other than starting N threads or making tasks cooperate in some way)?

It's possible using something like quasar that does stack-slicing via an agent. Some degree of cooperation from the tasks is helpful, but it is possible to use AOP to insert suspension points from outside.
(IMO it's better to be explicit about what's going on (using e.g. Future and ForkJoinPool). If some plain code runs on one thread for a while and is then "magically" suspended and jumps to another thread, this can be very confusing to debug or reason about. With modern languages and libraries the overhead of being explicit about the asynchronicity boundaries should not be overwhelming. If your tasks are written in terms of generic types then it's fairly easy to pass-through something like scalaz Future. But that wouldn't meet your requirements as given).

As mentioned, Quasar does exactly that (it usually schedules N fibers on M threads, but you can set M to 1), using bytecode transformations. It even gives each task (AKA "fiber") its own stack trace, so you can dump it and get a complete stack trace without any interference from any other task sharing the thread.

Well you could try this
you need
A mechanism to save the current state of the task because when the task returns its frame would be popped from the call stack. Based on the return value or something like that you can determine weather it completed or not since you would need to re-execute it from the point where it left thus u need to preserve the state information.
Create a Request Data structure for each task. When ever a task wants to request something it logs it there , The data structure should support all the possible request a task can make.
Store these DS in a Map. At the end of the loop you can query this DS to determine the kind of resource required by each task.
get the resource put it in the DS . Start the task from the state when it returned.
The task queries the DS gets the resource.
The task should use this DS when ever it wants to use an external resource.
you would need to design the method in which resource is requested with special consideration since when you will re-execute the task again you would need to call this method yourself so that the task can execute from where it left.
*DS -> Data Structure
hope it helps.

zookeeper example - distributed math calculation

I am trying to learn zookeeper but all the tutorials and explanations are way too abstract for me have a clear view of how it could benefit my life, or, it is just another tool that is "awesome" but no one will ever use directly in real life.
I understand zookeeper is a "coordination tool", master, worker, assign task and a bunch of failure prevention staff. However, I have a very 'naive' real-world problem and wondering if zookeeper itself will help me solve it.
Say I have a big big file with many many lines of numbers, like below:
1000
23213
3231
4213
..
And the goal/output is to come up with another file, which contains the square of corresponding line.
1000^2
23213^2
...
I actually have a real-life use case that I implemented using python-flask server to distribute the work based on request from the workers but it is too fragile. Also I cannot easily track the failures. And I am wondering will zookeeper be the solution.
Can any zookeeper expert help me write some example code to distribute this work to maybe 3 computers. And in the end, send the data back to the master.
I totally understand it must be super easy to use map reduce or multithreading to make it happen but I am wondering would it be possible to just use zookeeper to show the idea that "zookeeper is a coordination tool".

One common way to use Zookeeper is by taking advantage of ephemeral nodes as locks so as to create a distributed work queue.
Workers browse a list inside ZK and attempt to create an ephemeral "lock" node.
If the attempt fails it means another worker has locked the node.
If the attempt succeeds the worker can do the operation (in your case do the math), and then write a new node and delete the old node.
The power of the ephemeral lock is that if the worker dies for whatever reason the connection is broken and ZK guarantees that the lock goes away automatically.

Is there an API that allows ordering event in clustered application?

Given the following facts, is there a existing open-source Java API (possibly as part of some greater product) that implements an algorithm enabling the reproducible ordering of events in a cluster environment:
1) There are N sources of events, each with a unique ID.
2) Each event produced has an ID/timestamp, which, together with
its source ID, makes it uniquely identifiable.
3) The ids can be used to sort the events.
4) There are M application servers receiving those events.
M is normally 3.
5) The events can arrive at any one or more of the application
servers, in no specific order.
6) The events are processed in batches.
7) The servers have to agree for each batch on the list of events
to process.
8) The event each have earliest and latest batch ID in which they
must be processed.
9) They must not be processed earlier, and are "failed" if they
cannot be processed before the deadline.
10) The batches are based on the real clock time. For example,
one batch per second.
11) The events of a batch are processed when 2 of the 3 servers
agree on the list of events to process for that batch (quorum).
12) The "third" server then has to wait until it possesses all the
required events before it can process that batch too.
13) Once an event was processed or failed, the source has to be
informed.
14) [EDIT] Events from one source must be processed (or failed) in
the order of their ID/timestamp, but there is no causality
between different sources.
Less formally, I have those servers that receive events. They start with the same initial state, and should keep in sync by agreeing on which event to process in which order. Luckily for me, the events are not to be processed ASAP, but "in a bit", so that I have some time to get the servers to agree before the deadline. But I'm not sure if that actually make any real difference to the algorithms. And if all servers agree on all batches, then they will always be in sync, therefore presenting a consistent view when queried.
While I would be most happy with a Java API, I would accept something else if I can call it from Java. And if there is no open-source API, but a clear algorithm, I would also take that as an answer and try to implement it myself.

Looking at the question and your follow-up there probably "wasn't" an API to satisfy your requirements. To day you could take a look at the Kafka (from LinkedIn)
Apache Kafka
And the general concept of "a log" entity, in what folks like to call 'big data':
The Log: What every software engineer should know about real-time data's unifying abstraction
Actually for your question, I'd begin with the blog about "the log". In my terms the way it works -- And Kafka isn't the only package out doing log handling -- Works as follows:
Instead of a queue based message-passing / publish-subscribe
Kafka uses a "log" of messages
Subscribers (or end-points) can consume the log
The log guarantees to be "in-order"; it handles giga-data, is fast
Double check on the guarantee, there's usually a trade-off for reliability
You just read the log, I think reads are destructive as the default.
If there's a subscriber group, everyone can 'read' before the log-entry dies.
The basic handling (compute) process for the log, is a Map-Reduce-Filter model so you read-everything really fast; keep only stuff you want; process it (reduce) produce outcome(s).
The downside seems to be you need clusters and stuff to make it really shine. Since different servers or sites was mentioned I think we are still on track. I found it a finicky to get up-and-running with the Apache downloads because the tend to assume non-Windows environments (ho hum).
The other 'fast' option would be
Apache Apollo
Which would need you to do the plumbing for connecting different servers. Since the requirements include ...
servers that receive events. They start with the same initial state, and should keep in sync by agreeing on which event to process in which order. Luckily for me, the events are not to be processed ASAP, but "in a bit", so that I have some time to get the servers to agree before the deadline
I suggest looking at a "Getting Started" example or tutorial with Kafka and then looking at similar ZooKeeper organised message/log software(s). Good luck and Enjoy!

So far I haven't got a clear answer, but I think it would be useful anyone interested to see what I found.
Here are some theoretical discussions related to the subject:
Dynamic Vector Clocks for Consistent Ordering of Events
Conflict-free Replicated Data Types
One way of making multiple concurent process wait for each other, which I could use to synchronize the "batches" is a distributed barrier. One Java implementation seems to be available on top of Hazelcast and another uses ZooKeeper
One simpler alternative I found is to use a DB. Every process inserts all events it receives into the DB. Depending on the DB design, this can be fully concurrent and lock-free, like in VoltDB, for example. Then at regular interval of one second, some "cron job" runs that selects and marks the events to be processed in the next batch. The job can run on every server. The first to run the job for one batches fixes the set of events, so that the others just get to use the list that the first one defined. Like that we have a guarantee that all batches contain the same set of event on all servers. And if we can use a complete order over the whole batch, which the cron job could specify itself, then the state of the servers will be kept in sync.

Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?

Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?
GAE Documentation says that FIFO is one of the factors that affect task execution order, but the same documentation says that “the system's scheduling may 'jump' new tasks to the head of the queue” and I have confirmed this behavior with a test. The effect: my events are being processed out of order.
Docs says:
https://developers.google.com/appengine/docs/java/taskqueue/overview-push
The order in which tasks are executed depends on several factors:
The position of the task in the queue. App Engine attempts to process tasks based on FIFO > (first in, first out) order. In general, tasks are inserted into the end of a queue, and
executed from the head of the queue.
The backlog of tasks in the queue. The system attempts to deliver the lowest latency
possible for any given task via specially optimized notifications to the scheduler.
Thus, in the case that a queue has a large backlog of tasks, the
system's scheduling may "jump" new tasks to the head of the queue.
The value of the task's etaMillis property. This property specifies the
earliest time that a task can execute. App Engine always waits until
after the specified ETA to process push tasks.
The value of the task's countdownMillis property. This property specifies the minimum
number of seconds to wait before executing a task. Countdown and eta
are mutually exclusive; if you specify one, do not specify the other.
What I need to do? In my use case, I'll process 1-2 million events/day coming from vehicles. These events can be sent at any interval (1 sec, 1 minute or 1 hour). The order of the event processing has to be assured. I need process by timestamp order, which is generated on a embedded device inside the vehicle.
What I have now?
A Rest servlet that is called by the consumer and creates a Task (Event data is on payload).
After this, a worker servlet get this Task and:
Deserialize Event data;
Put Event on Datastore;
Update Vehicle On Datastore.
So, again, is there any way to assure just FIFO behavior? Or how can I improve this solution to get this?

You need to approach this with three separate steps:
Implement a Sharding Counter to generate a monotonically
increasing ID. As much as I like to use the timestamp from
Google's server to indicate task ordering, it appears that timestamps
between GAE servers might vary more than what your requirement is.
Add your tasks to a Pull Queue instead of a Push Queue. When
constructing your TaskOption, add the ID obtained from Step #1 as a tag.
After adding the task, store the ID somewhere on your datastore.
Have your worker servlet lease Tasks by a certain tag from the Pull Queue.
Query the datastore to get the earliest ID that you need to fetch, and use the ID as
the lease tag. In this way, you can simulate FIFO behavior for your task queue.
After you finished your processing, delete the ID from your datastore, and don't forget to delete the Task from your Pull Queue too. Also, I would recommend you run your task consumptions on the Backend.
UPDATE:
As noted by Nick Johnson and mjaggard, sharding in step #1 doesn't seem to be viable to generate a monotonically increasing IDs, and other sources of IDs would then be needed. I seem to recall you were using timestamps generated by your vehicles, would it be possible to use this in lieu of a monotonically increasing ID?
Regardless of the way to generate the IDs, the basic idea is to use datastore's query mechanism to produce a FIFO ordering of Tasks, and use task Tag to pull specific task from the TaskQueue.
There is a caveat, though. Due to the eventual consistency read policy on high-replication datastores, if you choose HRD as your datastore (and you should, the M/S is deprecated as of April 4th, 2012), there might be some stale data returned by the query on step #2.

I think the simple answer is "no", however partly in order to help improve the situation, I am using a pull queue - pulling 1000 tasks at a time and then sorting them. If timing isn't important, you could sort them and put them into the datastore and then complete a batch at a time. You've still got to work out what to do with the tasks at the beginning and ends of the batch - because they might be out of order with interleaving tasks in other batches.

Ok. This is how I've done it.
1) Rest servlet that is called from the consumer:
If Event sequence doesn't match Vehicle sequence (from datastore)
Creates a task on a "wait" queue to call me again
else
State validation
Creates a task on the "regular" queue (Event data is on payload).
2) A worker servlet gets the task from the "regular" queue, and so on... (same pseudo code)
This way I can pause the "regular" queue in order to do a data maintenance without losing events.
Thank you for your answers. My solution is a mix of them.

You can put the work to be done in a row in the datastore with a create timestamp and then fetch work tasks by that timestamp, but if your tasks are being created too quickly you will run into latency issues.

Don't know the answer myself, but it may be possible that tasks enqueued using a deferred function might execute in order submitted. Likely you will need an engineer from G. to get an answer. Pull queues as suggested seem a good alternative, plus this would allow you to consider batching your put()s.
One note about sharded counters: they increase the probability of monotonically increasing ids, but do not guarantee them.

The best way to handle this, the distributed way or "App Engine way" is probably to modify your algorithm and data collection to work with just a timestamp, allowing arbitrary ordering of tasks.
Assuming this is not possible or too difficult, you could modify your algorithm as follow:
when creating the task don't put the data on payload but in the datastore, in a Kind with an ordering on timestamps and stored as a child entity of whatever entity you're trying to update (Vehicule?). The timestamps should come from the client, not the server, to guarantee the same ordering.
run a generic task that fetch the data for the first timestamp, process it, and then delete it, inside a transaction.

Following this thread, I am unclear as to whether the strict FIFO requirement is for all transactions received, or on a per-vehicle basis. Latter has more options vs. former.

Techniques for modeling a dynamic dataflow with Java concurrency API

EDIT: This is basically a "how to properly implement a data flow engine in Java" question, and I feel this cannot be adequately answered in a single answer (it's like asking, "how to properly implement an ORM layer" and getting someone to write out the details of Hibernate or something), so consider this question "closed".
Is there an elegant way to model a dynamic dataflow in Java? By dataflow, I mean there are various types of tasks, and these tasks can be "connected" arbitrarily, such that when a task finishes, successor tasks are executed in parallel using the finished tasks output as input, or when multiple tasks finish, their output is aggregated in a successor task (see flow-based programming). By dynamic, I mean that the type and number of successors tasks when a task finishes depends on the output of that finished task, so for example, task A may spawn task B if it has a certain output, but may spawn task C if has a different output. Another way of putting it is that each task (or set of tasks) is responsible for determining what the next tasks are.
Sample dataflow for rendering a webpage: I have as task types: file downloader, HTML/CSS renderer, HTML parser/DOM builder, image renderer, JavaScript parser, JavaScript interpreter.
File downloader task for HTML file
HTML parser/DOM builder task
File downloader task for each embedded file/link
If image, image renderer
If external JavaScript, JavaScript parser
JavaScript interpreter
Otherwise, just store in some var/field in HTML parser task
JavaScript parser for each embedded script
JavaScript interpreter
Wait for above tasks to finish, then HTML/CSS renderer (obviously not optimal or perfectly correct, but this is simple)
I'm not saying the solution needs to be some comprehensive framework (in fact, the closer to the JDK API, the better), and I absolutely don't want something as heavyweight is say Spring Web Flow or some declarative markup or other DSL.
To be more specific, I'm trying to think of a good way to model this in Java with Callables, Executors, ExecutorCompletionServices, and perhaps various synchronizer classes (like Semaphore or CountDownLatch). There are a couple use cases and requirements:
Don't make any assumptions on what executor(s) the tasks will run on. In fact, to simplify, just assume there's only one executor. It can be a fixed thread pool executor, so a naive implementation can result in deadlocks (e.g. imagine a task that submits another task and then blocks until that subtask is finished, and now imagine several of these tasks using up all the threads).
To simplify, assume that the data is not streamed between tasks (task output->succeeding task input) - the finishing task and succeeding task don't have to exist together, so the input data to the succeeding task will not be changed by the preceeding task (since it's already done).
There are only a couple operations that the dataflow "engine" should be able to handle:
A mechanism where a task can queue more tasks
A mechanism whereby a successor task is not queued until all the required input tasks are finished
A mechanism whereby the main thread (or other threads not managed by the executor) blocks until the flow is finished
A mechanism whereby the main thread (or other threads not managed by the executor) blocks until certain tasks have finished
Since the dataflow is dynamic (depends on input/state of the task), the activation of these mechanisms should occur within the task code, e.g. the code in a Callable is itself responsible for queueing more Callables.
The dataflow "internals" should not be exposed to the tasks (Callables) themselves - only the operations listed above should be available to the task.
Note that the type of the data is not necessarily the same for all tasks, e.g. a file download task may accept a File as input but will output a String.
If a task throws an uncaught exception (indicating some fatal error requiring all dataflow processing to stop), it must propagate up to the thread that initiated the dataflow as quickly as possible and cancel all tasks (or something fancier like a fatal error handler).
Tasks should be launched as soon as possible. This along with the previous requirement should preclude simple Future polling + Thread.sleep().
As a bonus, I would like to dataflow engine itself to perform some action (like logging) every time task is finished or when no has finished in X time since last task has finished. Something like: ExecutorCompletionService<T> ecs; while (hasTasks()) { Future<T> future = ecs.poll(1 minute); some_action_like_logging(); if (future != null) { future.get() ... } ... }
Are there straightforward ways to do all this with Java concurrency API? Or if it's going to complex no matter what with what's available in the JDK, is there a lightweight library that satisfies the requirements? I already have a partial solution that fits my particular use case (it cheats in a way, since I'm using two executors, and just so you know, it's not related at all to the web browser example I gave above), but I'd like to see a more general purpose and elegant solution.

How about defining interface such as:
interface Task extends Callable {
boolean isReady();
}
Your "dataflow engine" would then just need to manage a collection of Task objects i.e. allow new Task objects to be queued for excecution and allow queries as to the status of a given task (so maybe the interface above needs extending to include id and/or type). When a task completes (and when the engine starts of course) the engine must just query any unstarted tasks to see if they are now ready, and if so pass them to be run on the executor. As you mention, any logging, etc. could also be done then.
One other thing that may help is to use Guice (http://code.google.com/p/google-guice/) or a similar lightweight DI framework to help wire up all the objects correctly (e.g. to ensure that the correct executor type is created, and to make sure that Tasks that need access to the dataflow engine (either for their isReady method or for queuing other tasks, say) can be provided with an instance without introducing complex circular relationships.
HTH, but please do comment if I've missed any key aspects...
Paul.

Look at https://github.com/rfqu/df4j — a simple but powerful dataflow library. If it lacks some desired features, they can be added easily.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.