Techniques for modeling a dynamic dataflow with Java concurrency API

Techniques for modeling a dynamic dataflow with Java concurrency API - java

EDIT: This is basically a "how to properly implement a data flow engine in Java" question, and I feel this cannot be adequately answered in a single answer (it's like asking, "how to properly implement an ORM layer" and getting someone to write out the details of Hibernate or something), so consider this question "closed".
Is there an elegant way to model a dynamic dataflow in Java? By dataflow, I mean there are various types of tasks, and these tasks can be "connected" arbitrarily, such that when a task finishes, successor tasks are executed in parallel using the finished tasks output as input, or when multiple tasks finish, their output is aggregated in a successor task (see flow-based programming). By dynamic, I mean that the type and number of successors tasks when a task finishes depends on the output of that finished task, so for example, task A may spawn task B if it has a certain output, but may spawn task C if has a different output. Another way of putting it is that each task (or set of tasks) is responsible for determining what the next tasks are.
Sample dataflow for rendering a webpage: I have as task types: file downloader, HTML/CSS renderer, HTML parser/DOM builder, image renderer, JavaScript parser, JavaScript interpreter.
File downloader task for HTML file
HTML parser/DOM builder task
File downloader task for each embedded file/link
If image, image renderer
If external JavaScript, JavaScript parser
JavaScript interpreter
Otherwise, just store in some var/field in HTML parser task
JavaScript parser for each embedded script
JavaScript interpreter
Wait for above tasks to finish, then HTML/CSS renderer (obviously not optimal or perfectly correct, but this is simple)
I'm not saying the solution needs to be some comprehensive framework (in fact, the closer to the JDK API, the better), and I absolutely don't want something as heavyweight is say Spring Web Flow or some declarative markup or other DSL.
To be more specific, I'm trying to think of a good way to model this in Java with Callables, Executors, ExecutorCompletionServices, and perhaps various synchronizer classes (like Semaphore or CountDownLatch). There are a couple use cases and requirements:
Don't make any assumptions on what executor(s) the tasks will run on. In fact, to simplify, just assume there's only one executor. It can be a fixed thread pool executor, so a naive implementation can result in deadlocks (e.g. imagine a task that submits another task and then blocks until that subtask is finished, and now imagine several of these tasks using up all the threads).
To simplify, assume that the data is not streamed between tasks (task output->succeeding task input) - the finishing task and succeeding task don't have to exist together, so the input data to the succeeding task will not be changed by the preceeding task (since it's already done).
There are only a couple operations that the dataflow "engine" should be able to handle:
A mechanism where a task can queue more tasks
A mechanism whereby a successor task is not queued until all the required input tasks are finished
A mechanism whereby the main thread (or other threads not managed by the executor) blocks until the flow is finished
A mechanism whereby the main thread (or other threads not managed by the executor) blocks until certain tasks have finished
Since the dataflow is dynamic (depends on input/state of the task), the activation of these mechanisms should occur within the task code, e.g. the code in a Callable is itself responsible for queueing more Callables.
The dataflow "internals" should not be exposed to the tasks (Callables) themselves - only the operations listed above should be available to the task.
Note that the type of the data is not necessarily the same for all tasks, e.g. a file download task may accept a File as input but will output a String.
If a task throws an uncaught exception (indicating some fatal error requiring all dataflow processing to stop), it must propagate up to the thread that initiated the dataflow as quickly as possible and cancel all tasks (or something fancier like a fatal error handler).
Tasks should be launched as soon as possible. This along with the previous requirement should preclude simple Future polling + Thread.sleep().
As a bonus, I would like to dataflow engine itself to perform some action (like logging) every time task is finished or when no has finished in X time since last task has finished. Something like: ExecutorCompletionService<T> ecs; while (hasTasks()) { Future<T> future = ecs.poll(1 minute); some_action_like_logging(); if (future != null) { future.get() ... } ... }
Are there straightforward ways to do all this with Java concurrency API? Or if it's going to complex no matter what with what's available in the JDK, is there a lightweight library that satisfies the requirements? I already have a partial solution that fits my particular use case (it cheats in a way, since I'm using two executors, and just so you know, it's not related at all to the web browser example I gave above), but I'd like to see a more general purpose and elegant solution.

How about defining interface such as:
interface Task extends Callable {
boolean isReady();
}
Your "dataflow engine" would then just need to manage a collection of Task objects i.e. allow new Task objects to be queued for excecution and allow queries as to the status of a given task (so maybe the interface above needs extending to include id and/or type). When a task completes (and when the engine starts of course) the engine must just query any unstarted tasks to see if they are now ready, and if so pass them to be run on the executor. As you mention, any logging, etc. could also be done then.
One other thing that may help is to use Guice (http://code.google.com/p/google-guice/) or a similar lightweight DI framework to help wire up all the objects correctly (e.g. to ensure that the correct executor type is created, and to make sure that Tasks that need access to the dataflow engine (either for their isReady method or for queuing other tasks, say) can be provided with an instance without introducing complex circular relationships.
HTH, but please do comment if I've missed any key aspects...
Paul.

Look at https://github.com/rfqu/df4j — a simple but powerful dataflow library. If it lacks some desired features, they can be added easily.

Related

Reactive Programming vs Thread Based Programming

I am new to this concept and want to have a great understanding of this topic.
To make my point clear I want to take an analogy.
Let's take a scenario of Node JS which is single-threaded and provide fast IO operation using an event loop. Now that makes sense since It is single-threaded and is not blocked for any task.
While studying reactive programming in Java using reactor. I came to a situation where the main thread is blocked when an object subscribes and some delay event took place.
Then I came to know the concept of subscribeOn.boundedElastic and many more pipelines like this.
I got it that they are trying to make it asynchronous by moving those subscribers to other threads.
But if it occurs like this then why is the asynchronous. Is it not thread-based programming?
If we are trying to achieve the async behaviour of Node JS then according to my view it should be in a single thread.
Summary of my question is:
So I don't get the fact of using or calling reactive programming as asynchronous or functional programming because of two reason
Main thread is blocked
We can manage the thread and can run it in another pool. Runnable service/ callable we can also define.

First of all you can't compare asynchronous with functional programming. Its like comparing a rock with a banana. Its two separate things.
Functional programming is compared to other types of programming, like object oriented programming or procedural programming etc. etc.
Reactor is a java library, and java is an object oriented programming language with functional features.
Asynchronous i will explain with what wikipedia says
Asynchrony, in computer programming, refers to the occurrence of events independent of the main program flow and ways to deal with such events.
So basically how to handle stuff "around" your application, that is not a part of the main flow of your program.
In comparison to Blocking, wikipedia again:
A process that is blocked is one that is waiting for some event, such as a resource becoming available or the completion of an I/O operation.
A traditional servlet application works by assigning one thread per request.
So every time a request comes in, a thread is spawned, and this thread follows along the request until the request returns. If there is something blocking during this request, for instance reading a file from the operating system, or making a request to another service. The assigned thread will block and wait until the reading of the file is completed, or the request has returned etc.
Reactive works with subscribers and producers and makes heavy use of the observer pattern. Which means that as soon as some thing blocks, reactor can take that thread and use it for something else. And then it is un-blocked any thread can pick up where it left off. This makes sure that every thread is always in use, and utilized at 100%.
All things processed in reactor is done by the event loop the event loop is a single threaded loop that just processes events as quick as possible. Schedulers schedule things to be processed on the event loop, and after they are processed a scheduler picks up the result and carries on.
If you just run reactor you get a default scheduler that will schedule things for you completely automatically.
But lets say you have something blocking. Well then you will stop the event loop. And everything needs to wait for that thing to finish.
When you run a fully reactive application you usually get one event loop per core during startup. Which means lets say you have 4 cores, you get 4 event loops and you block one, then during that period of blockages your application runs 25% slower.
25% slower is a lot!
Well sometimes you have something that is blocking that you can't avoid. For instance an old database that doesn't have a non-blocking driver. Or you need to read files from the operating system in a blocking manor. How do you do then?
Well the reactor team built in a fallback, so that if you use onSubscribe in combination with its own elastic thread pool, then you will get the old servlet behaviour back for that single subscriber to a specific say endpoint etc.
This makes sure that you can run fully reactive stuff side by side with old legacy blocking things. So that maybe some reaquests usese the old servlet behaviour, while other requests are fully non-blocking.
You question is not very clear so i am giving you a very unclear answer. I suggest you read the reactor documentation and try out all their examples, as most of this information comes from there.

Java support for three different concurrency models

I am going through different concurrency model in multi-threading environment (http://tutorials.jenkov.com/java-concurrency/concurrency-models.html)
The article highlights about three concurrency models.
Parallel Workers
The first concurrency model is what I call the parallel worker model. Incoming jobs are assigned to different workers.
Assembly Line
The workers are organized like workers at an assembly line in a factory. Each worker only performs a part of the full job. When that part is finished the worker forwards the job to the next worker.
Each worker is running in its own thread, and shares no state with other workers. This is also sometimes referred to as a shared nothing concurrency model.
Functional Parallelism
The basic idea of functional parallelism is that you implement your program using function calls. Functions can be seen as "agents" or "actors" that send messages to each other, just like in the assembly line concurrency model (AKA reactive or event driven systems). When one function calls another, that is similar to sending a message.
Now I want to map java API support for these three concepts
Parallel Workers : Is it ExecutorService,ThreadPoolExecutor, CountDownLatch API?
Assembly Line : Sending an event to messaging system like JMS & using messaging concepts of Queues & Topics.
Functional Parallelism: ForkJoinPool to some extent & java 8 streams. ForkJoin pool is easy to understand compared to streams.
Am I correct in mapping these concurrency models? If not please correct me.

Each of those models says how the work is done/splitted from a general point of view, but when it comes to implementation, it really depends on your exact problem. Generally I see it like this:
Parallel Workers: a producer creates new jobs somewhere (e.g in a BlockingQueue) and many threads (via an ExecutorService) process those jobs in parallel. Of course, you could also use a CountDownLatch, but that means you want to trigger an action after exactly N subproblems have been processed (e.g you know your big problem may be split in N smaller problems, check the second example here).
Assembly Line: for every intermediate step, you have a BlockingQueue and one Thread or an ExecutorService. On each step the jobs are taken from one BlickingQueue and put in the next one, to be processed further. To your idea with JMS: JMS is there to connect distributed components and is part of the Java EE and was not thought to be used in a high concurrent context (messages are kept usually on the hard disk, before being processed).
Functional Parallelism: ForkJoinPool is a good example on how you could implement this.

An excellent question to which the answer might not be quite as satisfying. The concurrency models listed show some of the ways you might want to go about implementing an concurrent system. The API provides tools used to implementing any of these models.
Lets start with ExecutorService. It allows you to submit tasks to be executed in a non-blocking way. The ThreadPoolExecutor implementation then limits the maximum number of threads available. The ExecutorService does not require the task to perform the complete process as you might expect of a parallel worker. The task may be limited to specific part of the process and send a message upon completion that starts the next step in an assembly line.
The CountDownLatch and the ExecutorService provide a means to block until all workers have completed that may come in handy if a certain process has been divided to different concurrent sub-tasks.
The point of JMS is to provide a means for messaging between components. It does not enforce a specific model for concurrency. Queues and topics denote how a message is sent from a publisher to a subscriber. When you use queues the message is sent to exactly one subscriber. Topics on the other hand broadcast the message to all subscribers of the topic.
Similar behavior could be achieved within a single component by for example using the observer pattern.
ForkJoinPool is actually one implementation of ExecutorService (which might highlight the difficulty of matching a model and an implementation detail). It just happens to be optimized for working with large amount of small tasks.
Summary: There are multiple ways to implement a certain concurrency model in the Java environment. The interfaces, classes and frameworks used in implementing a program may vary regardless of the concurrency model chosen.

Actor model is another example for an Assembly line. Ex: akka

JVM: is it possible to manipulate frame stack?

Suppose I need to execute N tasks in the same thread. The tasks may sometimes need some values from an external storage. I have no idea in advance which task may need such a value and when. It is much faster to fetch M values in one go rather than the same M values in M queries to the external storage.
Note that I cannot expect cooperation from tasks themselves, they can be concidered as nothing more than java.lang.Runnable objects.
Now, the ideal procedure, as I see it, would look like
Execute all tasks in a loop. If a task requests an external value, remember this, suspend the task and switch to the next one.
Fetch the values requested at the previous step, all at once.
Remove all completed task (suspended ones don't count as completed).
If there are still tasks left, go to step 1, but instead of executing a task, continue its execution from the suspended state.
As far as I see, the only way to "suspend" and "resume" something would be to remove its related frames from JVM stack, store them somewhere, and later push them back onto the stack and let JVM continue.
Is there any standard (not involving hacking at lower level than JVM bytecode) way to do this?
Or can you maybe suggest another possible way to achieve this (other than starting N threads or making tasks cooperate in some way)?

It's possible using something like quasar that does stack-slicing via an agent. Some degree of cooperation from the tasks is helpful, but it is possible to use AOP to insert suspension points from outside.
(IMO it's better to be explicit about what's going on (using e.g. Future and ForkJoinPool). If some plain code runs on one thread for a while and is then "magically" suspended and jumps to another thread, this can be very confusing to debug or reason about. With modern languages and libraries the overhead of being explicit about the asynchronicity boundaries should not be overwhelming. If your tasks are written in terms of generic types then it's fairly easy to pass-through something like scalaz Future. But that wouldn't meet your requirements as given).

As mentioned, Quasar does exactly that (it usually schedules N fibers on M threads, but you can set M to 1), using bytecode transformations. It even gives each task (AKA "fiber") its own stack trace, so you can dump it and get a complete stack trace without any interference from any other task sharing the thread.

Well you could try this
you need
A mechanism to save the current state of the task because when the task returns its frame would be popped from the call stack. Based on the return value or something like that you can determine weather it completed or not since you would need to re-execute it from the point where it left thus u need to preserve the state information.
Create a Request Data structure for each task. When ever a task wants to request something it logs it there , The data structure should support all the possible request a task can make.
Store these DS in a Map. At the end of the loop you can query this DS to determine the kind of resource required by each task.
get the resource put it in the DS . Start the task from the state when it returned.
The task queries the DS gets the resource.
The task should use this DS when ever it wants to use an external resource.
you would need to design the method in which resource is requested with special consideration since when you will re-execute the task again you would need to call this method yourself so that the task can execute from where it left.
*DS -> Data Structure
hope it helps.

Java Framework for managing Tasks

my question is, whether there exists a framework in Java for managing and concurrently running Tasks that have logical dependencies.
My Task is as follows:
I have a lot of independent tasks (Let's say A,B,C,D...), They are implemented as Commands (like in Command pattern). I would like to have a kind of executor which will accept all these tasks and execute them in a parallel manner.
The tasks can be dependent one on another (For example, I can't run C, Before I run A), synchronous or asynchronous.
I would also like to incorporate the custom heuristics to affect the scheduler execution, for example if tasks A and B are CPU-intensive and C is, say, has high Memory consumption, It makes sense to run A and C in parallel, rather than running A and B.
Before diving into building this stuff by myself (i'm thinking about java.util.concurrent + annotation based constraints/rules), I was wondering, if someone could point me on some project that could suit my needs.
Thanks a lot in advance

I don't think that a there is a framework for managing tasks that could fulfill your requirements. You are on the right path using the Command pattern. You could take a look at the Akka framework for a simplified concurrency model. Akka is based on the Actor model:
The actor model is another very simple
high level concurrency model: actors
can’t respond to more than one message
at a time (messages are queued into
mailboxes) and can only communicate by
sending messages, not sharing
variables. As long as the messages are
immutable data structures (which is
always true in Erlang, but has to be a
convention in languages without means
of ensuring this property), everything
is thread-safe, without need for any
other mechanism. This is very similar
to request cycle found in web
development MVC frameworks.
http://metaphysicaldeveloper.wordpress.com/2010/12/16/high-level-concurrency-with-jruby-and-akka-actors/
Akka is written in Scala but it exposes clean Java API.

I'd recommend you to examine possibility to use ant for this purpose. Although ant is known as a popular build tool it actually the XML controlled engine that runs various tasks. I think that its flag fork=true does exactly what you need: runs tasks concurrently. As any java application ant can be executed from other java application: just call its main method. In this case you can wrap your tasks using ant API, i.e. implement them as Ant tasks.
I have never try this approach but I believe it should work. I thought about it several years ago and suggested it to my management as a possible solution for problem similar to yours.

Eclipse's job scheduling module is able to handle interdependent tasks. Take a look at http://www.eclipse.org/articles/Article-Concurrency/jobs-api.html.

There is a framework specifically for this purpose called dexecutor (Disclaimer : I am the owner)
Dexecutor is a very light weight framework to execute dependent/independent tasks in a reliable way, to do this it provides the minimal API.
An API to add nodes in the graph (addDependency, addIndependent, addAsDependentOnAllLeafNodes, addAsDependencyToAllInitialNodes Later two are the hybrid version of the first two)
and the other to execute the nodes in order.
Here is the simplest example :
DefaultDependentTasksExecutor<Integer, Integer> executor = newTaskExecutor();
executor.addDependency(1, 2);
executor.addDependency(1, 2);
executor.addDependency(1, 3);
executor.addDependency(3, 4);
executor.addDependency(3, 5);
executor.addDependency(3, 6);
//executor.addDependency(10, 2); // cycle
executor.addDependency(2, 7);
executor.addDependency(2, 9);
executor.addDependency(2, 8);
executor.addDependency(9, 10);
executor.addDependency(12, 13);
executor.addDependency(13, 4);
executor.addDependency(13, 14);
executor.addIndependent(11);
executor.execute(ExecutionBehavior.RETRY_ONCE_TERMINATING);
Here is how the dependency graph would be constructed
Tasks 1,12,11 would run in parallel, once on of these tasks finishes dependent tasks would run, for example, lets say task 1 finishes, tasks 2 and 3 would run similarly once task 12, finishes task 13 would run and so on.

Implementing java FixedTreadPool status listener

It's about an application which is supposed to process (VAD, Loudness, Clipping) a lot of soundfiles (e.g. 100k). At this time, I create as many worker threads (callables) as I can put into memory, and then run all with a threadPool.invokeAll(), write results to file system, unload processed files and continue at step 1. Due to the fact it's an app with a GUI, i don't want to user to feel like the app "is not responding" while processing all soundfiles. (which it does at this time cause invokeAll is blocking). I'm not sure what is a "good" way to fix this. It shall not be possible for the user to do other things while processing, but I'd like to show a progress bar like "10 of 100000 soundfiles are done". So how do I get there? Do I have to create a "watcher thread", so that every worker hold a callback on it? I'm quite new to multi threading, and don't get the idea of such a mechanism.
If you need to know: I'm using SWT/JFace.

You could use an ExecutorCompletionService for this purpose; if you submit each of the Callable tasks in a loop, you can then call the take method of the completion service - receiving tasks one at a time as they finish. Every time you take a task, you can update your GUI.
As another option, you could implement your own ExecutorService that is also an Observable, allowing the publication of updates to subscribing Observers whenever a task is completed.

You should have a look at SwingWorker. It's a good class for doing lengthy operations whilst reporting back progress to the gui and maintaining a responsive gui.
Using a Swing Worker Thread provides some good information.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.