Clustering using Threads in Java - java

I have a job that takes too long time in Java. So I want to divide this job into threads and run them. After the threads finishes their jobs, returns to my service and Service give them new jobs. ThreadGroup is suitable for this or any other recommendation?

First of all, you need threads if either:
a) You have a multiprocessor machine
or b) You have a single processor but your jobs are IO-intensive (and not CPU-intensive)
Otherwise, you will gain nothing when using threads.
What you need here is ThreadPool

Check out the ExecutorCompletionService - it does exactly this.
Example: [pulled from Java 6 API JavaDocs]
void solve(Executor e, Collection<Callable<Result>> solvers)
throws InterruptedException, ExecutionException {
CompletionService<Result> ecs
= new ExecutorCompletionService<Result>(e);
for (Callable<Result> s : solvers)
ecs.submit(s);
int n = solvers.size();
for (int i = 0; i < n; ++i) {
Result r = ecs.take().get();
if (r != null)
use(r);
}
}

Not sure in what state of development your project currently is, since your problem statement is quite limited, but you might want to consider getting having a look at the fork-join project coming in JDK7: http://www.ibm.com/developerworks/java/library/j-jtp11137.html
There's a lot to gain & learn from looking at that, and since it's all open source you can already download the code as a patch and have a go at working with it.
(Might not be applicable for anything you have to implement right now, but worth a look non the less if you intend to develop / maintain your application for some time in the future)

Take a look at the java.util.concurrent package.
There's a tutorial where you should find everything you need to know here:
http://java.sun.com/docs/books/tutorial/essential/concurrency/
Focus on the High Level Concurrency Objects in particular.

ThreadGroup isn't generally of much use to application code. It's not a great deal of use to container code either. The Java PlugIn uses ThreadGroup to distinguish which applet a thread belongs to.
java.util.concurrent, in particular ExecutorService, provides amongst other things handy utilities for handling threads and concurrency.
For computationally intensive fine-grained tasks, the fork-join framework in JDK7 will be useful.
Before starting on this difficult code, you might want to consider whether it is worth it. Can you do other optimisations that doesn't require large scale thread use? Is it I/O latency you are trying to deal with? If it is CPU-intensive, there is not a great deal of point in using many more threads than you have in hardware.

Related

Multiprocessing in Java with Killable thread

I have a scenario in which I am running unreliable code in java (the scenario is not unlike this). I am providing the framework classes, and the intent is for the third party to overwrite a base class method called doWork(). However, if the client doWork() enters a funked state (such as an infinite loop), I need to be able to terminate the worker.
Most of the solutions (I've found this example and this example) revolve around a loop check for a volatile boolean:
while (keepRunning) {
//some code
}
or checking the interrupted status:
while (isInterrupted()) {
//some code
}
However, neither of these solutions deal with the the following in the '//some code' section:
for (int i = 0; i < 10; i++) {
i = i - 1;
}
I understand the reasons thread.stop() was depreciated, and obviously, running faulty code isn't desirable, though my situation forces me to run code I can't verify myself. But I find it hard to believe Java doesn't have some mechanism for handling threads which get into an unacceptable state. So, I have two questions:
Is it possible to launch a Thread or Runnable in Java which can be reliably killed? Or does Java require cooperative multithreading to the point where a thread can effectively hose the system?
If not, what steps can be taken to pass live objects such that the code can be run in a Process instead (such as passing active network connections) where I can actually kill it.?
If you really don't want to (or probably cannot due to requirement of passing network connections) spawn new processes, you can try to instrument code of this 'plugin' when you load it's class. I mean change it's bytecode so it will include static calls to some utility method (eg ClientMentalHealthChecker.isInterrupted()). It's actually not that hard to do. Here you can find some tools that might help: https://java-source.net/open-source/bytecode-libraries. It won't be bullet proof because there are other ways of blocking execution. Also keep in mind that clients code can catch InterruptedExceptions.

Trouble understanding Java threads

I learned about multiprocessing from Python and I'm having a bit of trouble understanding Java's approach. In Python, I can say I want a pool of 4 processes and then send a bunch of work to my program and it'll work on 4 items at a time. I realized, with Java, I need to use threads to achieve this same task and it seems to be working really really well so far.
But.. unlike in Python, my cpu(s) aren't getting 100% utilization (they are about 70-80%) and I suspect it's the way I'm creating threads (code is the same between Python/Java and processes are independent). In Java, I'm not sure how to create one thread so I create a thread for every item in a list I want to process, like this:
for (int i = 0; i < 500; i++) {
Runnable task = new MyRunnable(10000000L + i);
Thread worker = new Thread(task);
// We can set the name of the thread
worker.setName(String.valueOf(i));
// Start the thread, never call method run() direct
worker.start();
// Remember the thread for later usage
threads.add(worker);
}
I took it from here. My question is this the correct way to launch threads or is there a way to have Java itself manage the number of threads so it's optimal? I want my code to run as fast as possible and I'm trying to understand how to tell and resolve any issues that maybe arising from too many threads being created.
This is not a major issue, just curious to how it works under the Java hood.
You use an Executor, the implementation of which handles a pool of threads, decides how many, and so forth. See the Java tutorial for lots of examples.
In general, bare threads aren’t used in Java except for very simple things. Instead, there will be some higher-level API that receives your Runnable or Task and knows what to do.
Take a look at the Java Executor API. See this article, for example.
Although creating Threads is much 'cheaper' than it used to be, creating large numbers of threads (one per runnable as in your example) isn't the way to go - there's still an overhead in creating them, and you'll end up with too much context switching.
The Executor API allows you to create various types of thread pool for executing Runnable tasks, so you can reuse threads, flexibly manage the number that are created, and avoid the overhead of thread-per-runnable.
The Java threading model and the Python threading model (not multiprocessing) are really quite similar, incidentally. There isn't a Global Interpreter Lock as in Python, so there's usually less need to fork off multiple processes.
Thread is a "low level" API.
Depending on what you want to do, and the version of java you use, their is better solution.
If you use Java 7, and if your task allow it, you can use the fork/join framework : http://docs.oracle.com/javase/tutorial/essential/concurrency/forkjoin.html
However, take a look at the java concurrency tutorial : http://docs.oracle.com/javase/tutorial/essential/concurrency/executors.html

Java aims to be ‘Threaded’

i need help to under stand the threads in java.
A thread is a thread of execution in a program. The Java Virtual Machine allows an application to have multiple threads of execution running concurrently.
What do we mean when we say that Java aims to be ‘Threaded’
This means that various operations can and should be executed concurrently. This can be achieve by using threads. You can either use "low level" thread API (Thread, Runnable) or higher level API (Timer, Executors).
I hope this is enough to start googling and learn. I'd recommend you to start from low level threading API to understand how to work with threads and synchronization. Then go forward and learn facilities of concurrency package introduced in java 1.5. Do not start from higher level API. You need low level to understand later what happens behind the scene when you are submitting task to executor.
threads are a popular way to implement concurrency in languages. java has them. that's what it means.
"Java is threaded" means that Java could execute two or more jobs at the same time.
If you want to learn more about that look at Oracle Java concurrency tutorial: http://docs.oracle.com/javase/tutorial/essential/concurrency/
What do we mean when we say that Java aims to be ‘Threaded’
Well, literally we don't say that, because calling a runtime environment "threaded" means something rather different; see http://en.wikipedia.org/wiki/Threaded_code. (And note that that page takes care to distinguish between "threaded" and "multi-threaded"!)
In fact, we describe Java as being a language that supports "Multi-threaded" programming. The quotation in your question is a succinct description of what that means. A more long-winded description is as follows.
A program normally executes statements in sequence. So for example:
int i = 1;
i = i + j;
if (i < 10) {
...
}
In the above, the statements are executed one after another in sequence.
A thing that controls the execution of statements like that is called a "thread of control" or (more commonly) a thread. You can think of it as an automaton that executes statements one after another, and that is only capable of doing one at a time. It keeps a record of the state of the local variables and the procedure calls. (It typically uses a stack and a set of private registers to do this ... but that's an implementation detail.)
In a multi-threaded program, there are potentially many of these automatons, each executing a different sequence of statements (using its own stack and registers). Each thread is potentially able to communicate with other threads (by observing shared objects, etc) and can synchronize with them in various was and for various reasons.
Depending on the hardware (and the operating system), the threads may either all run on the same processor, or they may (at different times) run on different processors. It is typically a combination of the two, and it is typically up to the operating system to decide which of the threads that can do work is allowed to run. (This is handled by the thread scheduler.)
From a Java perspective, multi-threaded programming is implemented at the low level using the Thread class, synchronized methods and blocks, and the Object level wait and notify methods. Higher level APIs provide standard building blocks for solving common problems.

Java Framework for managing Tasks

my question is, whether there exists a framework in Java for managing and concurrently running Tasks that have logical dependencies.
My Task is as follows:
I have a lot of independent tasks (Let's say A,B,C,D...), They are implemented as Commands (like in Command pattern). I would like to have a kind of executor which will accept all these tasks and execute them in a parallel manner.
The tasks can be dependent one on another (For example, I can't run C, Before I run A), synchronous or asynchronous.
I would also like to incorporate the custom heuristics to affect the scheduler execution, for example if tasks A and B are CPU-intensive and C is, say, has high Memory consumption, It makes sense to run A and C in parallel, rather than running A and B.
Before diving into building this stuff by myself (i'm thinking about java.util.concurrent + annotation based constraints/rules), I was wondering, if someone could point me on some project that could suit my needs.
Thanks a lot in advance
I don't think that a there is a framework for managing tasks that could fulfill your requirements. You are on the right path using the Command pattern. You could take a look at the Akka framework for a simplified concurrency model. Akka is based on the Actor model:
The actor model is another very simple
high level concurrency model: actors
can’t respond to more than one message
at a time (messages are queued into
mailboxes) and can only communicate by
sending messages, not sharing
variables. As long as the messages are
immutable data structures (which is
always true in Erlang, but has to be a
convention in languages without means
of ensuring this property), everything
is thread-safe, without need for any
other mechanism. This is very similar
to request cycle found in web
development MVC frameworks.
http://metaphysicaldeveloper.wordpress.com/2010/12/16/high-level-concurrency-with-jruby-and-akka-actors/
Akka is written in Scala but it exposes clean Java API.
I'd recommend you to examine possibility to use ant for this purpose. Although ant is known as a popular build tool it actually the XML controlled engine that runs various tasks. I think that its flag fork=true does exactly what you need: runs tasks concurrently. As any java application ant can be executed from other java application: just call its main method. In this case you can wrap your tasks using ant API, i.e. implement them as Ant tasks.
I have never try this approach but I believe it should work. I thought about it several years ago and suggested it to my management as a possible solution for problem similar to yours.
Eclipse's job scheduling module is able to handle interdependent tasks. Take a look at http://www.eclipse.org/articles/Article-Concurrency/jobs-api.html.
There is a framework specifically for this purpose called dexecutor (Disclaimer : I am the owner)
Dexecutor is a very light weight framework to execute dependent/independent tasks in a reliable way, to do this it provides the minimal API.
An API to add nodes in the graph (addDependency, addIndependent, addAsDependentOnAllLeafNodes, addAsDependencyToAllInitialNodes Later two are the hybrid version of the first two)
and the other to execute the nodes in order.
Here is the simplest example :
DefaultDependentTasksExecutor<Integer, Integer> executor = newTaskExecutor();
executor.addDependency(1, 2);
executor.addDependency(1, 2);
executor.addDependency(1, 3);
executor.addDependency(3, 4);
executor.addDependency(3, 5);
executor.addDependency(3, 6);
//executor.addDependency(10, 2); // cycle
executor.addDependency(2, 7);
executor.addDependency(2, 9);
executor.addDependency(2, 8);
executor.addDependency(9, 10);
executor.addDependency(12, 13);
executor.addDependency(13, 4);
executor.addDependency(13, 14);
executor.addIndependent(11);
executor.execute(ExecutionBehavior.RETRY_ONCE_TERMINATING);
Here is how the dependency graph would be constructed
Tasks 1,12,11 would run in parallel, once on of these tasks finishes dependent tasks would run, for example, lets say task 1 finishes, tasks 2 and 3 would run similarly once task 12, finishes task 13 would run and so on.

Forcing multiple threads to use multiple CPUs when they are available

I'm writing a Java program which uses a lot of CPU because of the nature of what it does. However, lots of it can run in parallel, and I have made my program multi-threaded. When I run it, it only seems to use one CPU until it needs more then it uses another CPU - is there anything I can do in Java to force different threads to run on different cores/CPUs?
There are two basic ways to multi-thread in Java. Each logical task you create with these methods should run on a fresh core when needed and available.
Method one: define a Runnable or Thread object (which can take a Runnable in the constructor) and start it running with the Thread.start() method. It will execute on whatever core the OS gives it -- generally the less loaded one.
Tutorial: Defining and Starting Threads
Method two: define objects implementing the Runnable (if they don't return values) or Callable (if they do) interface, which contain your processing code. Pass these as tasks to an ExecutorService from the java.util.concurrent package. The java.util.concurrent.Executors class has a bunch of methods to create standard, useful kinds of ExecutorServices. Link to Executors tutorial.
From personal experience, the Executors fixed & cached thread pools are very good, although you'll want to tweak thread counts. Runtime.getRuntime().availableProcessors() can be used at run-time to count available cores. You'll need to shut down thread pools when your application is done, otherwise the application won't exit because the ThreadPool threads stay running.
Getting good multicore performance is sometimes tricky, and full of gotchas:
Disk I/O slows down a LOT when run in
parallel. Only one thread should do disk read/write at a time.
Synchronization of objects provides safety to multi-threaded operations, but slows down work.
If tasks are too
trivial (small work bits, execute
fast) the overhead of managing them
in an ExecutorService costs more than
you gain from multiple cores.
Creating new Thread objects is slow. The ExecutorServices will try to re-use existing threads if possible.
All sorts of crazy stuff can happen when multiple threads work on something. Keep your system simple and try to make tasks logically distinct and non-interacting.
One other problem: controlling work is hard! A good practice is to have one manager thread that creates and submits tasks, and then a couple working threads with work queues (using an ExecutorService).
I'm just touching on key points here -- multithreaded programming is considered one of the hardest programming subjects by many experts. It's non-intuitive, complex, and the abstractions are often weak.
Edit -- Example using ExecutorService:
public class TaskThreader {
class DoStuff implements Callable {
Object in;
public Object call(){
in = doStep1(in);
in = doStep2(in);
in = doStep3(in);
return in;
}
public DoStuff(Object input){
in = input;
}
}
public abstract Object doStep1(Object input);
public abstract Object doStep2(Object input);
public abstract Object doStep3(Object input);
public static void main(String[] args) throws Exception {
ExecutorService exec = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
ArrayList<Callable> tasks = new ArrayList<Callable>();
for(Object input : inputs){
tasks.add(new DoStuff(input));
}
List<Future> results = exec.invokeAll(tasks);
exec.shutdown();
for(Future f : results) {
write(f.get());
}
}
}
When I run it, it only seems to use
one CPU until it needs more then it
uses another CPU - is there anything I
can do in Java to force different
threads to run on different
cores/CPUs?
I interpret this part of your question as meaning that you have already addressed the problem of making your application multi-thread capable. And despite that, it doesn't immediately start using multiple cores.
The answer to "is there any way to force ..." is (AFAIK) not directly. Your JVM and/or the host OS decide how many 'native' threads to use, and how those threads are mapped to physical processors. You do have some options for tuning. For example, I found this page which talks about how to tune Java threading on Solaris. And this page talks about other things that can slow down a multi-threaded application.
First, you should prove to yourself that your program would run faster on multiple cores. Many operating systems put effort into running program threads on the same core whenever possible.
Running on the same core has many advantages. The CPU cache is hot, meaning that data for that program is loaded into the CPU. The lock/monitor/synchronization objects are in CPU cache which means that other CPUs do not need to do cache synchronization operations across the bus (expensive!).
One thing that can very easily make your program run on the same CPU all the time is over-use of locks and shared memory. Your threads should not talk to each other. The less often your threads use the same objects in the same memory, the more often they will run on different CPUs. The more often they use the same memory, the more often they must block waiting for the other thread.
Whenever the OS sees one thread block for another thread, it will run that thread on the same CPU whenever it can. It reduces the amount of memory that moves over the inter-CPU bus. That is what I guess is causing what you see in your program.
First, I'd suggest reading "Concurrency in Practice" by Brian Goetz.
This is by far the best book describing concurrent java programming.
Concurrency is 'easy to learn, difficult to master'. I'd suggest reading plenty about the subject before attempting it. It's very easy to get a multi-threaded program to work correctly 99.9% of the time, and fail 0.1%. However, here are some tips to get you started:
There are two common ways to make a program use more than one core:
Make the program run using multiple processes. An example is Apache compiled with the Pre-Fork MPM, which assigns requests to child processes. In a multi-process program, memory is not shared by default. However, you can map sections of shared memory across processes. Apache does this with it's 'scoreboard'.
Make the program multi-threaded. In a multi-threaded program, all heap memory is shared by default. Each thread still has it's own stack, but can access any part of the heap. Typically, most Java programs are multi-threaded, and not multi-process.
At the lowest level, one can create and destroy threads. Java makes it easy to create threads in a portable cross platform manner.
As it tends to get expensive to create and destroy threads all the time, Java now includes Executors to create re-usable thread pools. Tasks can be assigned to the executors, and the result can be retrieved via a Future object.
Typically, one has a task which can be divided into smaller tasks, but the end results need to be brought back together. For example, with a merge sort, one can divide the list into smaller and smaller parts, until one has every core doing the sorting. However, as each sublist is sorted, it needs to be merged in order to get the final sorted list. Since this is "divide-and-conquer" issue is fairly common, there is a JSR framework which can handle the underlying distribution and joining. This framework will likely be included in Java 7.
There is no way to set CPU affinity in Java. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4234402
If you have to do it, use JNI to create native threads and set their affinity.
You should write your program to do its work in the form of a lot of Callable's handed to an ExecutorService and executed with invokeAll(...).
You can then choose a suitable implementation at runtime from the Executors class. A suggestion would be to call Executors.newFixedThreadPool() with a number roughly corresponding to the number of cpu cores to keep busy.
The easiest thing to do is break your program into multiple processes. The OS will allocate them across the cores.
Somewhat harder is to break your program into multiple threads and trust the JVM to allocate them properly. This is -- generally -- what people do to make use of available hardware.
Edit
How can a multi-processing program be "easier"? Here's a step in a pipeline.
public class SomeStep {
public static void main( String args[] ) {
BufferedReader stdin= new BufferedReader( System.in );
BufferedWriter stdout= new BufferedWriter( System.out );
String line= stdin.readLine();
while( line != null ) {
// process line, writing to stdout
line = stdin.readLine();
}
}
}
Each step in the pipeline is similarly structured. 9 lines of overhead for whatever processing is included.
This may not be the absolute most efficient. But it's very easy.
The overall structure of your concurrent processes is not a JVM problem. It's an OS problem, so use the shell.
java -cp pipline.jar FirstStep | java -cp pipline.jar SomeStep | java -cp pipline.jar LastStep
The only thing left is to work out some serialization for your data objects in the pipeline.
Standard Serialization works well. Read http://java.sun.com/developer/technicalArticles/Programming/serialization/ for hints on how to serialize. You can replace the BufferedReader and BufferedWriter with ObjectInputStream and ObjectOutputStream to accomplish this.
I think this issue is related to Java Parallel Proccesing Framework (JPPF). Using this you can run diferent jobs on diferent processors.
JVM performance tuning has been mentioned before in Why does this Java code not utilize all CPU cores?. Note that this only applies to the JVM, so your application must already be using threads (and more or less "correctly" at that):
http://ch.sun.com/sunnews/events/2009/apr/adworkshop/pdf/5-1-Java-Performance.pdf
You can use below API from Executors with Java 8 version
public static ExecutorService newWorkStealingPool()
Creates a work-stealing thread pool using all available processors as its target parallelism level.
Due to work stealing mechanism, idle threads steal tasks from task queue of busy threads and overall throughput will increase.
From grepcode, implementation of newWorkStealingPool is as follows
/**
* Creates a work-stealing thread pool using all
* {#link Runtime#availableProcessors available processors}
* as its target parallelism level.
* #return the newly created thread pool
* #see #newWorkStealingPool(int)
* #since 1.8
*/
public static ExecutorService newWorkStealingPool() {
return new ForkJoinPool
(Runtime.getRuntime().availableProcessors(),
ForkJoinPool.defaultForkJoinWorkerThreadFactory,
null, true);
}

Categories

Resources