Java Framework for managing Tasks

Java Framework for managing Tasks - java

my question is, whether there exists a framework in Java for managing and concurrently running Tasks that have logical dependencies.
My Task is as follows:
I have a lot of independent tasks (Let's say A,B,C,D...), They are implemented as Commands (like in Command pattern). I would like to have a kind of executor which will accept all these tasks and execute them in a parallel manner.
The tasks can be dependent one on another (For example, I can't run C, Before I run A), synchronous or asynchronous.
I would also like to incorporate the custom heuristics to affect the scheduler execution, for example if tasks A and B are CPU-intensive and C is, say, has high Memory consumption, It makes sense to run A and C in parallel, rather than running A and B.
Before diving into building this stuff by myself (i'm thinking about java.util.concurrent + annotation based constraints/rules), I was wondering, if someone could point me on some project that could suit my needs.
Thanks a lot in advance

I don't think that a there is a framework for managing tasks that could fulfill your requirements. You are on the right path using the Command pattern. You could take a look at the Akka framework for a simplified concurrency model. Akka is based on the Actor model:
The actor model is another very simple
high level concurrency model: actors
can’t respond to more than one message
at a time (messages are queued into
mailboxes) and can only communicate by
sending messages, not sharing
variables. As long as the messages are
immutable data structures (which is
always true in Erlang, but has to be a
convention in languages without means
of ensuring this property), everything
is thread-safe, without need for any
other mechanism. This is very similar
to request cycle found in web
development MVC frameworks.
http://metaphysicaldeveloper.wordpress.com/2010/12/16/high-level-concurrency-with-jruby-and-akka-actors/
Akka is written in Scala but it exposes clean Java API.

I'd recommend you to examine possibility to use ant for this purpose. Although ant is known as a popular build tool it actually the XML controlled engine that runs various tasks. I think that its flag fork=true does exactly what you need: runs tasks concurrently. As any java application ant can be executed from other java application: just call its main method. In this case you can wrap your tasks using ant API, i.e. implement them as Ant tasks.
I have never try this approach but I believe it should work. I thought about it several years ago and suggested it to my management as a possible solution for problem similar to yours.

Eclipse's job scheduling module is able to handle interdependent tasks. Take a look at http://www.eclipse.org/articles/Article-Concurrency/jobs-api.html.

There is a framework specifically for this purpose called dexecutor (Disclaimer : I am the owner)
Dexecutor is a very light weight framework to execute dependent/independent tasks in a reliable way, to do this it provides the minimal API.
An API to add nodes in the graph (addDependency, addIndependent, addAsDependentOnAllLeafNodes, addAsDependencyToAllInitialNodes Later two are the hybrid version of the first two)
and the other to execute the nodes in order.
Here is the simplest example :
DefaultDependentTasksExecutor<Integer, Integer> executor = newTaskExecutor();
executor.addDependency(1, 2);
executor.addDependency(1, 2);
executor.addDependency(1, 3);
executor.addDependency(3, 4);
executor.addDependency(3, 5);
executor.addDependency(3, 6);
//executor.addDependency(10, 2); // cycle
executor.addDependency(2, 7);
executor.addDependency(2, 9);
executor.addDependency(2, 8);
executor.addDependency(9, 10);
executor.addDependency(12, 13);
executor.addDependency(13, 4);
executor.addDependency(13, 14);
executor.addIndependent(11);
executor.execute(ExecutionBehavior.RETRY_ONCE_TERMINATING);
Here is how the dependency graph would be constructed
Tasks 1,12,11 would run in parallel, once on of these tasks finishes dependent tasks would run, for example, lets say task 1 finishes, tasks 2 and 3 would run similarly once task 12, finishes task 13 would run and so on.

Related

Creating Threads with java in AppEngine Standard Environment

I'm new in Google Cloud Platform. I'm using AppEngine standard Environment. I need to create Threads in java but I think it's not possible, is it?
Here is the situation:
I need to create Feeds for users.
There are three databases with names d1, d2, d3.
Whenever a user sends a request for feeds Java creates three threads, one for each database. For example t1 for d1, t2 for d2 and t3 for d3. These threads must run asynchronously for better performance and after that the data from these 3 threads is combined and sent in the response back to user.
I know how to write code for this, but as you know I need threads for this work. If AppEngine standard Env. doesn't allow it then what can I do? Is there any other way?
In GCP Documentation they said:
To avoid using threads, consider Task Queues
I read about Task Queues. There are two types of queues: Push and Pull. Both run asynchronously but they do not send a response back to the user. I think they are only designed to complete tasks in the background.
Can you please let me know how can I achieve my goal? What things I need to learn for this?

Note: the answer is based solely on documentation, I'm not a java user.
Threads are supported by the standard environment, but with restrictions. From Threads:
Caution: Threads are a powerful feature that are full of surprises. To learn more about using threads with Java, we recommend
Goetz, Java Concurrency in Practice.
A Java application can create a new thread, but there are some
restrictions on how to do it. These threads can't "outlive" the
request that creates them.
An application can
Implement java.lang.Runnable.
Create a thread factory by calling com.google.appengine.api.ThreadManager.currentRequestThreadFactory().
Call the factory's newRequestThread method, passing in the Runnable, newRequestThread(runnable), or use the factory object
returned by
com.google.appengine.api.ThreadManager.currentRequestThreadFactory()
with an ExecutorService (e.g., call
Executors.newCachedThreadPool(factory)).
However, you must use one of the methods on ThreadManager to create
your threads. You cannot invoke new Thread() yourself or use the
default thread factory.
An application can perform operations against the current thread, such
as thread.interrupt().
Each request is limited to 50 concurrent request threads. The Java
runtime will throw a java.lang.IllegalStateException if you try to
create more than 50 threads in a single request.
When using threads, use high level concurrency objects, such as
Executor and Runnable. Those take care of many of the subtle but
important details of concurrency like Interrupts and scheduling
and bookkeeping.

An elegant way to implement what you need would be to create a parametrable endpoint in your application
/runFeed?db=d1
And from your "main" application code you can perform a fetchAsync call from URLFetchService that will return you a java.util.concurrent.Future<HTTPResponse>
This will allow you a better monitoring of what your application does.
This will add network latency to your application and increase its cost since urlFetchService is not free.

Java support for three different concurrency models

I am going through different concurrency model in multi-threading environment (http://tutorials.jenkov.com/java-concurrency/concurrency-models.html)
The article highlights about three concurrency models.
Parallel Workers
The first concurrency model is what I call the parallel worker model. Incoming jobs are assigned to different workers.
Assembly Line
The workers are organized like workers at an assembly line in a factory. Each worker only performs a part of the full job. When that part is finished the worker forwards the job to the next worker.
Each worker is running in its own thread, and shares no state with other workers. This is also sometimes referred to as a shared nothing concurrency model.
Functional Parallelism
The basic idea of functional parallelism is that you implement your program using function calls. Functions can be seen as "agents" or "actors" that send messages to each other, just like in the assembly line concurrency model (AKA reactive or event driven systems). When one function calls another, that is similar to sending a message.
Now I want to map java API support for these three concepts
Parallel Workers : Is it ExecutorService,ThreadPoolExecutor, CountDownLatch API?
Assembly Line : Sending an event to messaging system like JMS & using messaging concepts of Queues & Topics.
Functional Parallelism: ForkJoinPool to some extent & java 8 streams. ForkJoin pool is easy to understand compared to streams.
Am I correct in mapping these concurrency models? If not please correct me.

Each of those models says how the work is done/splitted from a general point of view, but when it comes to implementation, it really depends on your exact problem. Generally I see it like this:
Parallel Workers: a producer creates new jobs somewhere (e.g in a BlockingQueue) and many threads (via an ExecutorService) process those jobs in parallel. Of course, you could also use a CountDownLatch, but that means you want to trigger an action after exactly N subproblems have been processed (e.g you know your big problem may be split in N smaller problems, check the second example here).
Assembly Line: for every intermediate step, you have a BlockingQueue and one Thread or an ExecutorService. On each step the jobs are taken from one BlickingQueue and put in the next one, to be processed further. To your idea with JMS: JMS is there to connect distributed components and is part of the Java EE and was not thought to be used in a high concurrent context (messages are kept usually on the hard disk, before being processed).
Functional Parallelism: ForkJoinPool is a good example on how you could implement this.

An excellent question to which the answer might not be quite as satisfying. The concurrency models listed show some of the ways you might want to go about implementing an concurrent system. The API provides tools used to implementing any of these models.
Lets start with ExecutorService. It allows you to submit tasks to be executed in a non-blocking way. The ThreadPoolExecutor implementation then limits the maximum number of threads available. The ExecutorService does not require the task to perform the complete process as you might expect of a parallel worker. The task may be limited to specific part of the process and send a message upon completion that starts the next step in an assembly line.
The CountDownLatch and the ExecutorService provide a means to block until all workers have completed that may come in handy if a certain process has been divided to different concurrent sub-tasks.
The point of JMS is to provide a means for messaging between components. It does not enforce a specific model for concurrency. Queues and topics denote how a message is sent from a publisher to a subscriber. When you use queues the message is sent to exactly one subscriber. Topics on the other hand broadcast the message to all subscribers of the topic.
Similar behavior could be achieved within a single component by for example using the observer pattern.
ForkJoinPool is actually one implementation of ExecutorService (which might highlight the difficulty of matching a model and an implementation detail). It just happens to be optimized for working with large amount of small tasks.
Summary: There are multiple ways to implement a certain concurrency model in the Java environment. The interfaces, classes and frameworks used in implementing a program may vary regardless of the concurrency model chosen.

Actor model is another example for an Assembly line. Ex: akka

Trouble understanding Java threads

I learned about multiprocessing from Python and I'm having a bit of trouble understanding Java's approach. In Python, I can say I want a pool of 4 processes and then send a bunch of work to my program and it'll work on 4 items at a time. I realized, with Java, I need to use threads to achieve this same task and it seems to be working really really well so far.
But.. unlike in Python, my cpu(s) aren't getting 100% utilization (they are about 70-80%) and I suspect it's the way I'm creating threads (code is the same between Python/Java and processes are independent). In Java, I'm not sure how to create one thread so I create a thread for every item in a list I want to process, like this:
for (int i = 0; i < 500; i++) {
Runnable task = new MyRunnable(10000000L + i);
Thread worker = new Thread(task);
// We can set the name of the thread
worker.setName(String.valueOf(i));
// Start the thread, never call method run() direct
worker.start();
// Remember the thread for later usage
threads.add(worker);
}
I took it from here. My question is this the correct way to launch threads or is there a way to have Java itself manage the number of threads so it's optimal? I want my code to run as fast as possible and I'm trying to understand how to tell and resolve any issues that maybe arising from too many threads being created.
This is not a major issue, just curious to how it works under the Java hood.

You use an Executor, the implementation of which handles a pool of threads, decides how many, and so forth. See the Java tutorial for lots of examples.
In general, bare threads aren’t used in Java except for very simple things. Instead, there will be some higher-level API that receives your Runnable or Task and knows what to do.

Take a look at the Java Executor API. See this article, for example.
Although creating Threads is much 'cheaper' than it used to be, creating large numbers of threads (one per runnable as in your example) isn't the way to go - there's still an overhead in creating them, and you'll end up with too much context switching.
The Executor API allows you to create various types of thread pool for executing Runnable tasks, so you can reuse threads, flexibly manage the number that are created, and avoid the overhead of thread-per-runnable.
The Java threading model and the Python threading model (not multiprocessing) are really quite similar, incidentally. There isn't a Global Interpreter Lock as in Python, so there's usually less need to fork off multiple processes.

Thread is a "low level" API.
Depending on what you want to do, and the version of java you use, their is better solution.
If you use Java 7, and if your task allow it, you can use the fork/join framework : http://docs.oracle.com/javase/tutorial/essential/concurrency/forkjoin.html
However, take a look at the java concurrency tutorial : http://docs.oracle.com/javase/tutorial/essential/concurrency/executors.html

Techniques for modeling a dynamic dataflow with Java concurrency API

EDIT: This is basically a "how to properly implement a data flow engine in Java" question, and I feel this cannot be adequately answered in a single answer (it's like asking, "how to properly implement an ORM layer" and getting someone to write out the details of Hibernate or something), so consider this question "closed".
Is there an elegant way to model a dynamic dataflow in Java? By dataflow, I mean there are various types of tasks, and these tasks can be "connected" arbitrarily, such that when a task finishes, successor tasks are executed in parallel using the finished tasks output as input, or when multiple tasks finish, their output is aggregated in a successor task (see flow-based programming). By dynamic, I mean that the type and number of successors tasks when a task finishes depends on the output of that finished task, so for example, task A may spawn task B if it has a certain output, but may spawn task C if has a different output. Another way of putting it is that each task (or set of tasks) is responsible for determining what the next tasks are.
Sample dataflow for rendering a webpage: I have as task types: file downloader, HTML/CSS renderer, HTML parser/DOM builder, image renderer, JavaScript parser, JavaScript interpreter.
File downloader task for HTML file
HTML parser/DOM builder task
File downloader task for each embedded file/link
If image, image renderer
If external JavaScript, JavaScript parser
JavaScript interpreter
Otherwise, just store in some var/field in HTML parser task
JavaScript parser for each embedded script
JavaScript interpreter
Wait for above tasks to finish, then HTML/CSS renderer (obviously not optimal or perfectly correct, but this is simple)
I'm not saying the solution needs to be some comprehensive framework (in fact, the closer to the JDK API, the better), and I absolutely don't want something as heavyweight is say Spring Web Flow or some declarative markup or other DSL.
To be more specific, I'm trying to think of a good way to model this in Java with Callables, Executors, ExecutorCompletionServices, and perhaps various synchronizer classes (like Semaphore or CountDownLatch). There are a couple use cases and requirements:
Don't make any assumptions on what executor(s) the tasks will run on. In fact, to simplify, just assume there's only one executor. It can be a fixed thread pool executor, so a naive implementation can result in deadlocks (e.g. imagine a task that submits another task and then blocks until that subtask is finished, and now imagine several of these tasks using up all the threads).
To simplify, assume that the data is not streamed between tasks (task output->succeeding task input) - the finishing task and succeeding task don't have to exist together, so the input data to the succeeding task will not be changed by the preceeding task (since it's already done).
There are only a couple operations that the dataflow "engine" should be able to handle:
A mechanism where a task can queue more tasks
A mechanism whereby a successor task is not queued until all the required input tasks are finished
A mechanism whereby the main thread (or other threads not managed by the executor) blocks until the flow is finished
A mechanism whereby the main thread (or other threads not managed by the executor) blocks until certain tasks have finished
Since the dataflow is dynamic (depends on input/state of the task), the activation of these mechanisms should occur within the task code, e.g. the code in a Callable is itself responsible for queueing more Callables.
The dataflow "internals" should not be exposed to the tasks (Callables) themselves - only the operations listed above should be available to the task.
Note that the type of the data is not necessarily the same for all tasks, e.g. a file download task may accept a File as input but will output a String.
If a task throws an uncaught exception (indicating some fatal error requiring all dataflow processing to stop), it must propagate up to the thread that initiated the dataflow as quickly as possible and cancel all tasks (or something fancier like a fatal error handler).
Tasks should be launched as soon as possible. This along with the previous requirement should preclude simple Future polling + Thread.sleep().
As a bonus, I would like to dataflow engine itself to perform some action (like logging) every time task is finished or when no has finished in X time since last task has finished. Something like: ExecutorCompletionService<T> ecs; while (hasTasks()) { Future<T> future = ecs.poll(1 minute); some_action_like_logging(); if (future != null) { future.get() ... } ... }
Are there straightforward ways to do all this with Java concurrency API? Or if it's going to complex no matter what with what's available in the JDK, is there a lightweight library that satisfies the requirements? I already have a partial solution that fits my particular use case (it cheats in a way, since I'm using two executors, and just so you know, it's not related at all to the web browser example I gave above), but I'd like to see a more general purpose and elegant solution.

How about defining interface such as:
interface Task extends Callable {
boolean isReady();
}
Your "dataflow engine" would then just need to manage a collection of Task objects i.e. allow new Task objects to be queued for excecution and allow queries as to the status of a given task (so maybe the interface above needs extending to include id and/or type). When a task completes (and when the engine starts of course) the engine must just query any unstarted tasks to see if they are now ready, and if so pass them to be run on the executor. As you mention, any logging, etc. could also be done then.
One other thing that may help is to use Guice (http://code.google.com/p/google-guice/) or a similar lightweight DI framework to help wire up all the objects correctly (e.g. to ensure that the correct executor type is created, and to make sure that Tasks that need access to the dataflow engine (either for their isReady method or for queuing other tasks, say) can be provided with an instance without introducing complex circular relationships.
HTH, but please do comment if I've missed any key aspects...
Paul.

Look at https://github.com/rfqu/df4j — a simple but powerful dataflow library. If it lacks some desired features, they can be added easily.

What are some strategies to unit test a scheduler?

This post started out as "What are some common patterns in unit testing multi-threaded code ?", but I found some other discussions on SO that generally agreed that "It is Hard (TM)" and "It Depends (TM)". So I thought that reducing the scope of the question would be more useful.
Background : We are implementing a simple scheduler that gives you a way to register callbacks when starting and stopping jobs and of course configure the frequency of scheduling. Currently, we're making a lightweight wrapper around java.util.Timer.
Aspects:
I haven't found a way to test this scheduler by relying on only public interfaces (something like addJob(jobSchedule, jobArgs,jobListener) , removeJob(jobId)).
How do I time the fact that the the job was called according to the schedule specified ?

you could use a recorder object that record the order, timings and other useful stuff in each unit test of your scheduler. The test is simple:
create a recorder object
configure the schedule
execute a unit test
check that recorder object is "compatible" with the schedule

One thing also to remember is that you don't need to test that Timer works. You can write a mock version of Timer (by extending the class or using EasyMock) that simply checks that you are calling it correctly, possibly even replacing enough that you don't need threads at all. In this case that might be more work than needed if your job listener has enough callbacks to track the scheduler.
The other important thing to remember is that when testing the scheduler, use custom jobs that track how the scheduler is working; when testing scheduled jobs, call the callbacks directly and not through the scheduler. You may have a higher level integration test that checks both together, depending on the system.

There are many failure modes that such a scheduler could exhibit, and each would most likely require its own test case. These test cases are likely to be very different, so "it depends."
For testing concurrent software in Java in general, I recommend this presentation from JavaOne 2007: Testing Concurrent Software.
For testing that a scheduler must execute jobs in accurate accordance to their schedule, I'd create an abstraction of time itself. I've done something similar in one of my projects, where I have a Time or Clock interface. The default implementation will be MillisecondTime, but during testing I will switch it out with a TickTime. This implementation will allow my unit test to control when the time advances and by how much.
This way, you could write a test where a job is scheduled to run once every 10 tick. Then your test just advances the tick counter and checks to make sure that the jobs run at the correct ticks.

A couple of ways to test concurrent code.
run the same code many times under load, some bugs appear only occasionally, but can show up consistently if performed repeatedly.
Store the results of different threads/jobs in a collection such as a BlockingQueue. This will allow you to check the results in the current thread and finish in a timely manner (without ugly arbitrary sleep statements)
If you are finding testing concurrency difficult consider refactoring your objects/components to make them easier to test.

If the scheduler delegates to an Executor or ExecutorService to run the tasks, you could use Dependency Injection to remove a direct dependency on the type of Executor, and use a simple single threaded Executor to test much of the functionality of your scheduler without the complication of truly multi-threaded code. Once you'd got those tests debugged, you could move on the the harder, but now substantially reduced in magnitude, task of testing thread-safety.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.